Close

Presentation

Preparing Data at Scale: The Data Pipeline for AuroraGPT
DescriptionAuroraGPT seeks to test the hypothesis that a model trained on
additional science data and text will improve performance on scien-
tific tasks. If we consider that existing models such as PALM—the
predecessor to Google’s Gemini model family were trained on
770B tokens of which only ∼1.9% was scientific text. To meet our
goal, we seek to incorporate substantially more scientific text.

In this presentation, we will share the recent progress of the
AuroraGPT Data Team, how we contribute the project of building
a science-focused LLM with AuroraGPT, how we collaborate with
the other teams, and what topics we see as open questions. As
the data team, our team is responsible for identifying, preparing,
and deduplicating scientific data and text. We will talk about the
systems and data quality challenges that our team tackles to prepare
terabytes of scientific data and text to produce high-quality text
and data for training.