Alongside an explosion in research and development related to large language models, there has been a concomitant rise in the creation of pretraining datasets-massive collections of text, typically scraped from the web. Drawing on the field of archival studies... ...