首页 正文

The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior

{{output}}
Standard offline evaluations for language models fail to capture how these models actually behave in practice, where personalization fundamentally alters model behavior. In this work, we provide empirical evidence showcasing this phenomenon by comparing offlin... ...