The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior

Standard offline evaluations for language models fail to capture how these models actually behave in practice, where personalization fundamentally alters model behavior. In this work, we provide empirical evidence showcasing this phenomenon by comparing offlin... ...

请注册登录后继续浏览