首页 正文

An archival perspective on pretraining data

{{output}}
Alongside an explosion in research and development related to large language models, there has been a concomitant rise in the creation of pretraining datasets-massive collections of text, typically scraped from the web. Drawing on the field of archival studies... ...