Why do RAG pipelines fail? Advanced RAG Patterns — Part1

Ozgur Guler
6 min readOct 16, 2023
midjourney — openai, retrieval augmented generation

The failures in RAG pipelines can be attributed to a cascade of challenges spanning the retrieval of data, the augmentation of this information, and the subsequent generation process. Beyond the technical intricacies, external factors like data biases, ever-evolving domains, and the dynamic nature of language further complicate the landscape. This article delves into the myriad reasons behind these failures, offering a holistic view of the hurdles faced by RAG implementations.

In this first post of “Advanced RAG Pipelines” series, we will go over why RAG pipelines fail by dividing potential problems into a-Retrieval problems, b-Augmentation problems and c-Generation problems.

a) Retrieval problems

Usually, if a RAG system is not performing well, it is because the retrieval step is having a hard time finding the right context to use for generation.

When using vector search to return similarity-based results, discrepancies can arise for several reasons:

1. Semantic Ambiguity: Vector representations, such as word embeddings, may not capture nuanced differences between concepts. For instance, the word “apple” might refer to the fruit or the tech company. Embeddings may conflate these meanings, leading to non-relevant results.

2. Magnitude vs. Direction: Cosine similarity, a common measure, focuses on the direction of vectors and not their magnitude. This might result in matches that are semantically distant but directionally similar.

3. Granularity Mismatch: Your query vector may represent a specific concept, but if your dataset has only broader topics, you may retrieve broader results than desired.

4. Vector Space Density: In high-dimensional spaces, the difference in distances between closely related and unrelated items might be very small. This can lead to seemingly unrelated results being considered relevant.

5. Global vs. Local Similarities: Most vector search mechanisms identify global similarities. Sometimes, you might be interested in local or contextual similarities which are not captured.

6. Sparse Retrieval Challenges: Retrieval mechanisms might struggle to identify the right passages in vast datasets, especially if the information required is spread thinly across multiple documents.

e.g. When querying about a niche topic like “Graph Machine Learning use-cases in cloud security”, if the retrieval system fetches generic articles on “cloud security” without specifics on Graph ML, the subsequent generation will miss the mark.

As generative AI and transformers become more popular, dynamic embeddings and context-aware searches might reduce some of these discrepancies. Additionally, hybrid models combining symbolic and sub-symbolic AI elements might offer better precision.

b) Augmentation Problems

1.Integration of Context: The challenge here is smoothly integrating the context of retrieved passages with the current generation task. If not done well, the output might appear disjointed or lack coherence.

Example: If a retrieved passage provides in-depth information about “Python’s history” and the generation task is to elaborate on “Python’s applications”, the output might overemphasize the history at the expense of applications.

2. Redundancy and Repetition: If multiple retrieved passages contain similar information, the generation step might produce repetitive content.

Example: If three retrieved articles all mention “PyTorch’s dynamic computation graph”, the generated content might redundantly emphasize this point multiple times.

3. Ranking and Priority: Deciding the importance or relevance of multiple retrieved passages for the generation task can be challenging. The augmentation process must weigh the value of each passage appropriately.

Example: For a query on “cloud security best practices”, if a retrieved passage about “two-factor authentication” is ranked lower than a less crucial point, the final output might misrepresent the importance of two-factor authentication.

4. Mismatched Styles or Tones: Retrieved content might come from sources with diverse writing styles or tones. The augmentation process needs to harmonize these differences to ensure a consistent output.

Example: If one retrieved passage is written in a casual tone while another is more formal, the final generation might oscillate between these styles, leading to a less cohesive response.

5. Over-reliance on Retrieved Content: The generation model might lean too heavily on the augmented information, leading to outputs that parrot the retrieved content rather than adding value or providing synthesis.

Example: If the retrieved passages offer multiple perspectives on “Graph Machine Learning Technologies”, but the generated output only reiterates these views without synthesising or providing additional insight, the augmentation hasn’t added substantial value.

c) Generation problems

LLM related problems

  • Hallucinations — LLM making up non-factual data
  • Misinformation — factuality problems, training data includes incorrect information.
  • Limited context lengths — You will be limited with the context length of the LLM of choice. Usually the larger the context length is the more context you will be able to submit along with the LLM.

(For a more comprehensive take on inherent problems with LLM’s refer to my earlier post -Taming the Wild — Enhancing LLM Reliability- here.)

Below is a list of further generation problems which may undermine the performance of RAG pipelines…

1.Coherence and Consistency: Ensuring the generated output is logically coherent and maintains a consistent narrative, especially when integrating retrieved information, can be challenging.

Example: The output might start discussing “Python’s efficiency in machine learning” and abruptly switch to “Python’s use in web development” without a clear transition.

2. Verbose or Redundant Outputs: The generation model might produce unnecessarily lengthy responses or repeat certain points.

Example: In elaborating on “advantages of PyTorch”, the model might mention “dynamic computation graph” multiple times in different phrasings.

3. Over-generalization: The model might provide generic answers instead of specific, detailed responses tailored to the query.

Example: A query about “differences between PyTorch and TensorFlow” might receive a broad response about the importance of deep learning frameworks without addressing the specific differences.

4. Lack of Depth or Insight: Even with relevant retrieved information, the generated response might not delve deep enough or provide insightful synthesis.

Example: When asked about “potential applications of Graph Machine Learning in cloud technologies”, the model might list general applications without elaborating or providing unique insights.

5. Error Propagation from Retrieval: Mistakes or biases in the retrieved data can be carried forward and amplified in the generation.

Example: If a retrieved passage inaccurately claims “next.js is a backend framework”, the generated content might expand on this incorrect premise.

6. Stylistic Inconsistencies: The generated content might not maintain a consistent style, especially when trying to blend information from diverse retrieved sources.

Example: Mixing formal technical explanations with casual anecdotal content within the same response.

7. Failure to Address Contradictions: If retrieved passages contain contradictory information, the generation model might struggle to reconcile these differences or might even reproduce the contradictions in the output.

Example: If one retrieved source says “PyTorch is primarily for research” and another says “PyTorch is widely used in production”, the generated response might confusingly state both without clarification.

8. Context Ignorance: The generated response might sometimes miss or misinterpret the broader context or intent behind a query.

Example: In response to “Tell me a fun fact about machine learning,” the model might produce a highly technical point rather than something light and interesting for a general audience.

These generation problems highlight the complexities of producing accurate, relevant, and high-quality content, even when augmented with retrieved data. It underscores the need for iterative refinement, feedback loops, and possibly even domain-specific tuning to optimize the generation process, especially in specialized fields like cloud technologies and machine learning.

Next we will look at how we can solve these problems…

**References**:
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

--

--