An archival perspective on pretraining data

Alongside an explosion in research and development related to large language models, there has been a concomitant rise in the creation of pretraining datasets-massive collections of text, typically scraped from the web. Drawing on the field of archival studies... ...

请注册登录后继续浏览