인공지능/논문 리뷰 or 진행

Multi-turn, Long-context Benchmark 논문 4

이게될까 2026. 2. 4. 02:51
728x90
728x90

https://aclanthology.org/2024.emnlp-main.811/

 

LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History

Akash Gupta, Ivaxi Sheth, Vyas Raina, Mark Gales, Mario Fritz. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

aclanthology.org

 

 

https://arxiv.org/abs/2502.05167

 

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haysta

arxiv.org

 

 

https://arxiv.org/abs/2501.17399

 

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of c

arxiv.org

 

https://arxiv.org/abs/2505.17123

 

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to t

arxiv.org

 

 

https://arxiv.org/abs/2403.06447

 

CoRAL: Collaborative Retrieval-Augmented Large Language Models Improve Long-tail Recommendation

The long-tail recommendation is a challenging task for traditional recommender systems, due to data sparsity and data imbalance issues. The recent development of large language models (LLMs) has shown their abilities in complex reasoning, which can help to

arxiv.org

 

 

 

 

728x90