Mitigating the lost-in-the-middle phenomenon: a context re-ranking strategy for improving retrieval-augmented generation performance

Saveliev Dmitry

Аннотация статьи

Retrieval-Augmented Generation (RAG) systems have been able to achieve significant performance gains for large language models by leveraging knowledge from external sources. However, such systems suffer from a critical drawback known as the "missing middle" phenomenon, where relevant information in the second half of the retrieved context suffers from poor sequential decision making compared to information at the beginning or end. Since such positional bias inevitably degrades the performance and robustness of the RAG system, we present a novel context reranking method that addresses this drawback using a lightweight transformer-based model trained using contrastive learning. Our method restructures retrieved documents to maximize positional stance before being processed by the language model. Large-scale experiments on the Natural Questions and HotpotQA datasets show consistent performance gains across all positional scenarios, with accuracy improving by 12–15% on middle-stance examples. The suggestedReader reranker operates with low overhead and computational efficiency while outperforming the baseline RAG models. Our ablation experiment demonstrates that the 4-layer transformer configuration results in an ideal tradeoff between performance gain and computational overhead, which is the empirical "sweet spot" of the implementation in practice.

1. Positional Bias: Standard information retrieval systems often assume that documents are uniform in quality and a good ranking system is one that produces a good set of top documents. This means that the rank order is the most crucial aspect of a retrieval system. In contrast, RAG systems rely on the top document only. The retriever does not have positional bias, and this turns out to be very important for understanding why RAG systems work so well.

2. Context Re-ranking: It is often said that large language models are really just ordinary models that have been trained on a huge amount of data. However, when trying to use large language models to rank documents, using the same training objective that we would for a smaller model leads to poor results. We show that this is because the large model makes much better use of context. A re-ranking system that uses a large language model as its scoring function works surprisingly well.

Текст статьи

1. Introduction

The large language model (LLM) development has rapidly revolutionized natural language processing. However, these models and their applications are still affected by the age-old problems of knowledge obsolescence, falsifiability (Are you sure that what you said is correct?), and a tendency to make things up (as in a novel or work of art). We're still working on that. One interesting potential remedy is what's called Retrieval-Augmented Generation (RAG). In this new architecture, you combine a language model with a search engine and get back stuff anchored in reality as you're doing the generation part. So, is RAG the answer to our LLM recipe for success? Well, yes and no. If you put together a 2023-era LLM with a 2023-era search engine, you get something that still suffers from a pretty basic performance limitation. The problem in question – a newly named effect that occurs when the RAG system gets lost in the middle – is worse than it seems. The hype is mostly unfounded. The 10,000-foot overview of the situation we've just given delves into the nitty-gritty in key parts and includes a context re-ranking algorithm that – without needing groundbreaking new techniques – offers notable performance upping on relevant downstream tasks.

2. Related Work

2.1. Retrieval-Augmented Generation Systems

The work by Lewis et al. (2020) on RAG was pioneering in establishing the blend of contextual knowledge that parametric (in the form of model weights) and non-parametric (in the form of external text accessed via retrieval) knowledge provide. The most significant advances since then have focused on making the retrieval itself better [7, p. 6769-6781], improving the generation quality of what is output at the end [16, p. 3784-3803], and generally optimizing the system architecture for the kind of domain-specific tasks that we are especially interested in [5, p. 874-880]. More recently, RAG systems have been asked to perform multi-hop reasoning [17, p. 539-554], to do better iterative retrieval [6, p. 7969-7992], and to adapt well to specific tasks [21, p. 15395-15434]. However, these enhancements have neglected to consider the challenge of handling positionally biased contexts, which is a real issue for RAG systems and a big research gap.

2.2. Positional Bias in Language Models

Positional bias in transformer models has been found to be a universal tendency across not only model scales but also varying architectures [11, p. 157-173]. These authors showed that attention patterns in language models are U-shaped – that is, with middle positions in a lengthy context receiving the least amount of attention. This effect has been replicated in several follow-up studies that have examined not just the original model family (e.g., BERT, GPT-2, etc.) but also other model families [8; 19, p. 4195-4205]. Mechanistic interpretability research has even attempted to explain why this phenomenon occurs, with some studies attributing it to attention head specialization (which makes sense given the nature of the task) and others pointing out that it might be a by-product of how poorly the position encodings and model keys work together [15, p. 842-866; 18, p. 63-76]. However, none of this work has really helped us figure out what we should do about these effects, should we desire not to have them in our production systems. To be fair, the problem is probably under-researched in general.

2.3. Re-ranking in Information Retrieval

Information retrieval systems have a long-standing relationship with re-ranking methods. These systems, when utilizing second-stage models, perform re-ranking on the initial retrieval results. This is done, generally speaking, to make the information obtained from the retrieval system more relevant to the user's query. In recent memory, the re-ranking methods commonly used in machine learning were applied to the context of deep learning. These more modern approaches utilize transformer architectures. They achieve state-of-the-art performances via making use of deeper semantic understanding of the retrieved texts. Extending re-ranking to RAG systems is what we do in this work, but with a twist (that is, without a major premise of irrelevance to the user's query).

3. Methodology

3.1 Problem Formulation

For a user query q and a set of retrieved documents D = {d₁, d₂, ..., dₙ}, conventional RAG systems concatenate the documents in their original retrieval order to create context C = [d₁; d₂; ...; dₙ]. The language model then generates a response r conditioned on this context: r = LLM(q, C). Our method adds a re-ranking function R(q, D) → D' that reorders documents to maximize positional placement before conditioning the language model on the context. The result is a better RAG pipeline: r = LLM(q, R(q, D)).

3.2. Context Re-ranker Architecture

Our re-ranking model is a lightweight transformer aimed at computational efficiency without losing advanced semantic comprehension. We have a model with 2-4 transformer layers with multi-head attention and fine-tuning for query-document relevance scoring. The architecture passes query-document pairs through shared embedding layers, topped by transformer blocks that calculate attention-weighted representations. A final classifier produces relevance scores for document reordering. We have a trade-off between expressiveness and computational tractability that makes our model feasible for production.

3.3. Contrastive Learning Training

Render all the other documents in the batch even less relevant than they appeared before.

In training, we use a batch size of 128, and the number of positive examples ranges from 4 to 6, indicating which of the 128 documents are more relevant to the 128 queries in the batch. The model has 512 output dimensions from the second-to-last layer, which produces an embedding for a document. We compute the positive relevance score as the inner product between the embedding for the document that has a correct answer and the embedding for the query. For the contrastive framework, the model learns to push the positive score higher and the scores for the negatives lower. to distinguish between relevant and irrelevant content.

3.4. Training Data Construction

We construct training datasets from the Natural Questions [9, p. 453-466] and HotpotQA [20, p. 2369-2380] benchmarks. There, we form query-document pairs with known relevance labels. The documents that contain ground-truth answers are our positive examples. The negative examples are randomly sampled documents from the corpus. To make our training resemble a real-world retrieval scenario, we ensure that the training batches contain documents of different relevance levels. So, we have some documents that are partially relevant to the query, some that are totally irrelevant, and our positive documents that are spot-on. This training strategy is useful because it directly helps the model generalize and become more robust.

4. Experimental Setup

4.1. Datasets and Evaluation

We assess our approach on two widely used question-answering benchmarks, Natural Questions and HotpotQA. Natural Questions is derived from genuine searches made by users in Google, accompanied by the potential to access relevant Wikipedia articles, while HotpotQA aims to do something quite different: interrogate the reasoning capabilities of state-of-the-art models in situations where several documents are necessary to arrive at the right answer. We design three evaluation settings for each benchmark to look at the same kind of effect through three different lenses. This is not a story about how well or poorly the models perform. Instead, it is a controlled narrative about how retrieved document position affects answer quality.

4.2. Baseline Systems

We compare our re-ranking method against several baseline systems. These include the default RAG run without re-ranking, a randomly permuted document run, and relevance-based re-ranking that does not optimize for the positions of the top retrieved documents. These baselines make for a thorough performance perspective and allow us to examine what our re-ranker is adding in terms of individual contributions.

4.3. Evaluation Metrics

The principal factor is the accuracy of the answers given by the model, which we test in a variety of positional setups. We score the models using exact match criteria and F1 scoring, using the ground-truth answers as our scoring gold standard. Two other factors are important in evaluating the model. One is how hard or easy the model is to run, which we express in terms of the number of floating-point operations needed to do a forward pass through the model. The other is how fast the model runs in inference mode.

5. Results

5.1. Performance Analysis

Our experimental results show consistent and substantial improvements in every evaluation setting. The re-ranked RAG system delivers a large performance boost over the baseline systems. Our largest and most significant gains are for the middle-position cases.

Performance Comparison: Baseline vs. Re-ranked RAG.

In this comparison, we observe the performance of two approaches. The first is a baseline model that retrieves documents directly from a database to answer questions. The second approach is a re-ranked model, which first retrieves documents and then re-ranks them based on their relevance to the question being asked. The y-axis shows the performance level, while the x-axis represents different instances or cases being evaluated.

Table 1

Performance comparison between the baseline RAG model and the RAG model after re-ranking the output over different positional configurations on the Natural Questions dataset

Position	Baseline RAG (%)	Re-ranked RAG (%)	Improvement
Beginning	75.2 ± 3.1	81.4 ± 2.8	+6.2%
Middle	48.7 ± 4.2	63.1 ± 3.7	+12.6%
End	63.1 ± 3.7	67.9 ± 3.2	+4.8%

It is clear that our method is quite successful in reducing the "lost-in-the-middle" effect.

5.2. Ablation Studies

We conducted in-depth ablation studies to understand how different architectural setups affect performance and computational cost. Our studies show that model complexity has a very limited effect on improving performance once the number of transformer layers exceeds four. In fact, the best configuration we found has only four layers.

Ablation Study: Performance vs. Computational Cost.

Table 2

Results of the ablation study on the performance vs. computational cost trade-offs for various re-ranker architectures. The 4-layer setup stands out as the best option, with dramatic performance gains and tolerable computational demands. The 4-layer setup has an accuracy of 81.1% with acceptable computational overhead, the practical sweet spot for production use

Configuration	Accuracy (%)	FLOPs (10)	Latency (ms)
2-layer	72.3	1.2	15.3
4-layer	81.1	2.4	23.7
6-layer	85.6	4.8	41.2
8-layer	87.2	9.6	78.5

5.3. Computational Efficiency Analysis

Our re-ranking method imposes only a slight increase in computation on the existing RAG pipeline. The lightweight transformer architecture contained within can still process its query-document pairs with nearly the same efficiency. Overall, we see an approximately 8–12% increase in inference time that, nonetheless, corresponds to sizable performance lifts and, we believe, substantial and more importantly, very affordable improvements for production systems. Indeed, the extra computational expense is usually balanced out by the efficiency gains stemming from our method's better document selection.

6. Discussion

6.1. Theoretical Implications

Our results provide empirical confirmation of positional bias theories for transformer architectures, along with practical techniques for mitigating this bias. The reason we have confidence in this being more than just a task-specific optimization is because we see consistent gains in performance across all our evaluation settings. It's worth noting that the follow-up paper to this one also reports consistent improvements when using position-sensitive techniques. Factor in the success that our context re-ranker has had when using contrastive learning to optimize it, and one begins to see a pretty clear picture.

6.2. Practical Considerations

Conducting context re-ranking in RAG models requires a careful balance of competing demands: resources, and latency. Similar to all ablation studies, ours gives very direct and clear guidance about which parameters to ablate and which not to ablate, when re-ranking is performed in a production RAG model. Consequently, it would also be a reasonable conjecture to assert that the ablation analysis results could guide toward the working implementation of such a production RAG model.

6.3. Limitations and Future Work

Despite the considerable gains our method shows, some limitations need to be brought up. First, our present evaluation is on English-language question-answering tasks, and our method's generalization to other languages and other domains needs exploring. Second, the computational overhead, though small, can be important in very high-throughput applications. So, our present work needs extending in a number of future directions: generalizing the method to simple multi-modal settings and (more ambitiously) to dynamic, query-dependent re-ranking methods; and combining our method with other RAG improvement methods to see if our method's positional optimization effect works with (as it might) iterative retrieval, multi-hop reasoning, or other methods not using simple query dependency.

7. Conclusion

We have created a comprehensive solution to the end-in-the-middle problem affecting RAG systems in the form of intelligent context re-ranking. We unite theoretical knowledge of positional bias with practical advances in architecture, all to achieve substantial performance improvements–and we did so with computational efficiency in mind. The gains show up across a panoply of eval settings, but they are especially striking in mid-position cases, which is where RAG systems tend to fall down.

We offer up the context re-ranking advance as an element of RAG systems, which will obviously increase their accuracy and efficiency in performing KALGs.

Список литературы

Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Amodei D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, P. 1877-1901.
Burges C., Shaked T., Renshaw E., Lazier A., Deeds M., Hamilton N., Hullender G. (2005). Learning to rank using gradient descent. Proceedings of the 22nd International Conference on Machine Learning, P. 89-96.
Cao Z., Qin T., Liu T.Y., Tsai M.F., Li H. (2007). Learning to rank: from pairwise approach to listwise approach. Proceedings of the 24th International Conference on Machine Learning, P. 129-136.
Gao L., Ma X., Lin J., Callan J. (2023). Precise zero-shot dense retrieval without relevance labels. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, P. 1762-1777.
Izacard G., Grave E. (2021). Leveraging passage retrieval with generative models for open domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, P. 874-880.
Jiang Z., Xu F.F., Gao L., Sun Z., Liu Q., Dwivedi-Yu J., Neubig G. (2023). Active retrieval augmented generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, P. 7969-7992.
Karpukhin V., Oguz B., Min S., Lewis P., Wu L., Edunov S., Yih W.T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, P. 6769-6781.
Kumar S., Sumers T.R., Agarwal T., Wiegreffe S., Bills S., Marasović A., Griffiths T.L. (2023). Exploring model architecture and training data contributions to the lost-in-the-middle effect. arXiv preprint arXiv:2309.08493.
Kwiatkowski T., Palomaki J., Redfield O., Collins M., Parikh A., Alberti C., Petrov S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, P. 453-466.
Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Kiela D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, P. 9459-9474.
Liu N.F., Lin K., Hewitt J., Paranjape A., Bevilacqua M., Petroni F., Liang P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, P. 157-173.
MacAvaney S., Yates A., Cohan A., Goharian N. (2019). CEDR: Contextualized embeddings for document ranking. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, P. 1101-1104.
Nogueira R., Cho K. (2019). Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085.
Qu Y., Ding Y., Liu J., Liu K., Ren R., Zhao W.X., Wen J.R. (2021). RocketQA: An optimized training method to dense passage retrieval for open-domain question answering. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, P. 5835-5847.
Rogers A., Kovaleva O., Rumshisky A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, P. 842-866.
Shuster K., Poff S., Chen M., Kiela D., Weston J. (2021). Retrieval augmentation reduces hallucination in conversation. Findings of the Association for Computational Linguistics: EMNLP 2021, P. 3784-3803.
Trivedi H., Balasubramanian N., Khot T., Sabharwal A. (2022). MuSiQue: Multihop questions through single-hop question composition. Transactions of the Association for Computational Linguistics, 10, P. 539-554.
Vig J., Belinkov Y. (2019). Analyzing the structure of attention in a transformer language model. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, P. 63-76.
Wang S., Liu Y., Xu Y., Zhu C., Zeng M. (2023). Want to reduce labeling cost? GPT-3 can help. Findings of the Association for Computational Linguistics: EMNLP 2021, P. 4195-4205.
Yang Z., Qi P., Zhang S., Bengio Y., Cohen W.W., Salakhutdinov R., Manning C.D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, P. 2369-2380.
Yoran O., Wolfson T., Bogin B., Katz U., Deutch D., Berant J. (2023). Answering questions by meta-reasoning over multiple chains of thought. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, P. 15395-15434.
Zhang S., Roller S., Goyal N., Artetxe M., Chen M., Chen S., Zettlemoyer L. (2023). OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.

Mitigating the lost-in-the-middle phenomenon: a context re-ranking strategy for improving retrieval-augmented generation performance

Похожие статьи

Другие статьи из раздела «Информационные технологии»