Advanced chunking: Moving beyond arbitrary token chunking

Introduction

Chunking is a critical step in RAG pipelines, where the text is divided into smaller, manageable segments for embedding and retrieval. The way text is chunked can significantly impact the system's overall performance, influencing both retrieval precision and the quality of generated responses.

While naive chunking methods, such as splitting text by a fixed number of tokens or characters, are easy to implement, they often disrupt the logical flow of information. This can result in irrelevant or incomplete retrievals, which hinder the model's ability to generate coherent and contextually accurate responses.

In this chapter, we will explore advanced chunking techniques that aim to address these limitations. Starting from semantic chunking, which preserves contextual coherence, to innovative approaches like late chunking, we will demonstrate how these methods can enhance the performance of RAG systems. By tailoring chunking to the needs of both retrieval and generation, these approaches ensure more accurate information retrieval and improved output quality.

Naive chunking & data

Naive chunking divides text into sections based on fixed rules, such as a set number of tokens, characters, or lines. This simple approach is easy to implement but often disrupts the logical flow of the text, splitting sentences or paragraphs inappropriately, which can compromise retrieval and downstream processing.

For this chapter, we’ll use Alice’s Adventures in Wonderland by Lewis Carroll. This classic tale, known for its whimsical narrative and imaginative scenes, provides an engaging dataset to demonstrate the impact of various chunking methods. Starting with naive chunking, we’ll highlight its shortcomings and use this as a foundation to explore more advanced techniques for creating meaningful text segments.

Code example

#!pip install llama_index llama_index.readers.web
from llama_index.readers.web import SimpleWebPageReader
# Download the text from Project Gutenberg
url = "https://www.gutenberg.org/files/11/11-0.txt"

documents = SimpleWebPageReader(html_to_text=True).load_data(
    [url]
)

from llama_index.core.node_parser import SentenceSplitter
# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

base_nodes = base_splitter.get_nodes_from_documents(documents)

print(base_nodes[0].get_content())
print(base_nodes[1].get_content())

First two chunks

Chunk 1
Chunk 2

Semantic chunking

Semantic chunking is an approach to dividing text into contextually meaningful sections rather than using fixed token or character limits. This technique ensures that each chunk captures a coherent unit of information, improving its relevance for RAG systems.

Key Features:

  1. Dynamic Splitting: Semantic chunking identifies meaningful boundaries, such as paragraphs, sections, or sentences, instead of following arbitrary length rules.
  2. Contextual Coherence: Chunks are designed to preserve logical flow, making each unit self-contained and informative when retrieved.
  3. Improved Retrieval: By encapsulating complete ideas or topics, semantic chunks are more likely to produce accurate and relevant results during retrieval.

Use Case:

This method is particularly useful for documents where sections naturally represent distinct concepts, such as FAQs, technical documentation, or legal texts. It enables retrieval systems to fetch and process well-defined, contextually relevant content more effectively.

Further resources:

Code example

This example was adapted from our friends at Llamaindex: Google ColabGoogle Colab

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
import os

os.environ["OPENAI_API_KEY"] = ""

embed_model = OpenAIEmbedding()

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

semantic_nodes = splitter.get_nodes_from_documents(documents)

First three chunks

Chunk 1
Chunk 2
Chunk 3

I have included a third chunk here, because the semantic splitter gets a bit confused on the listing of chapters. You can compare the third chunk to what was created above using naive chunking. It could be argued that the semantic chunk is a logical part of the story, instead of an arbitrary chunk of X tokens.

Late chunking

Late chunking is an advanced approach to embedding and chunking text that shifts the traditional paradigm. Unlike conventional methods, where documents are split into smaller chunks before being embedded, late chunking defers the splitting process until after generating token-level embeddings for the entire document. This approach allows for more contextually informed chunking and enables embedding strategies that better capture the relationships between tokens within and across chunks.

How Late Chunking Works

1. Generate Token-Level Embeddings for the Entire Document

The process begins with the entire document being fed into the embedding model as a single input. Instead of splitting the document upfront, every individual token is embedded, resulting in a detailed token-level embedding matrix. This matrix captures the semantic relationships between all tokens in the document.

  • Why do this?
    • It retains the full context of the document when generating embeddings.
    • Token embeddings are aware of their position and relationship to other tokens in the entire document, providing richer representations.

2. Group Tokens into Chunks Post-Embedding

Once token-level embeddings are generated, they are grouped into chunks. Unlike naive chunking, where chunks are defined by fixed-length windows, late chunking allows for more informed grouping based on the semantic or contextual relationships between tokens. Key methods for grouping include:

  • Fixed-Length Grouping: Tokens are divided into chunks of a fixed size (e.g., 200 tokens per chunk). This is straightforward but doesn't account for semantic boundaries.
  • Semantic Grouping: Tokens are grouped based on logical or contextual markers, such as paragraph breaks, sentences, or topics. For example:
    • Tokens belonging to the same paragraph or semantic idea are grouped into a single chunk.
    • Sentence boundaries or special tokens like [SEP] can guide the segmentation.
  • Sliding Windows with Overlap: To ensure continuity, overlapping chunks are created. For example, a window might group tokens 0–199 as one chunk, and tokens 100–299 as another.

This grouping ensures that chunks are coherent and contextually meaningful, enhancing their utility in retrieval tasks.

3. Conditionally Embed Chunks

After grouping, embeddings for individual chunks are created using the token embeddings. A pooling operation (e.g., average pooling or max pooling) is typically applied to reduce the grouped token embeddings into a single vector representation for each chunk. These chunk embeddings are no longer independent; they carry contextual signals from the surrounding tokens, making them more representative of the entire document's semantics.

  • Why is this better?
    • Chunk embeddings are not treated as isolated units; instead, they are conditioned on the entire document context.
    • This avoids the loss of coherence that can occur with traditional pre-split chunking methods.
image

Key Advantages

  • Context-Aware Chunking: Chunks are formed with full awareness of the document's overall structure and relationships between tokens.
  • Improved Embedding Quality: Late chunking captures more global relationships, which enhances the utility of chunk embeddings in downstream tasks like retrieval or generation.
  • Flexibility: By deferring chunking until after embedding, you can dynamically adapt chunking strategies based on the task (e.g., retrieval, summarization).

Further reading

Code example

Full example can be found here: Google ColabGoogle Colab. I adapted the code to our Alice in Wonderland use case, but the late chunking didn’t outperform traditional chunking. There can be several reasons for this, and it is important not to throw the baby out with the bathwater. This is a new and experimental technique, so performance will likely vary depending on the use case. There is no package or complex e2e example, so keep your eyes open for news regarding Late Chunking, the math behind it makes sense!

Other notable techniques

Sentence-window splitting

The Sentence-Window Splitter adjusts chunk sizes for different stages of the RAG pipeline. For retrieval, it processes individual sentences to improve precision, while for generation, it adds surrounding sentences to provide the LLM with more context. This approach aims to balance the requirements of both retrieval and generation stages.

image
from llama_index.core.node_parser import SentenceWindowNodeParser
sentence_window_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3, #controls the number of surrounding sentences to both sides
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

sentence_window_nodes = sentence_window_parser.get_nodes_from_documents(documents)
Sentence
Window Context

Auto-Merging Index

The Auto-Merging Index operates on a similar principle to the Sentence-Window Splitter but uses token length as the primary criterion instead of sentences. During the retrieval process, smaller, contextually related text segments are merged dynamically to create bigger context that can provide more information for the generator LLM.

By focusing on token length, the Auto-Merging Index provides a more flexible method for handling text that might not naturally divide into clean sentences. Like the Sentence-Window Splitter, it aims to optimize retrieval by maintaining small chunks for precision, while ensuring sufficient context for generation by aggregating related text.

image

Document Summary Index

The Document Summary Index uses summaries as a high-level retrieval mechanism while linking them to the underlying chunks they represent. In this approach, retrieval is performed based on summaries, and once a relevant summary is identified, the associated chunks are provided for detailed context. A summary can map to a single chunk or combine insights from multiple chunks, making this method effective for large or dense documents where summaries guide retrieval to more detailed content.

image

So what?

Well, you guessed it, run your own experiments. I have not found a comprehensive overview of these advanced chunking methods, so it is up to you to run one for your specific use-case. However, these experiments can get very costly - after all, they require you to reindex your whole dataset. Therefore I recommend running these experiments only on a manageable subset of your data.

My personal recommendation would be to go with semantic chunking or sentence-window. Sentences and paragraphs maintain logical separation of the text and it intuitively makes sense. However, always rely on hard data over intuition! If you are wondering how to run these experiments, refer to the module [LINK] where we provide deep dive on experimenting on your own use-case.

Conclusion

Effective chunking is essential for optimizing the performance of RAG systems. While naive chunking provides a simple starting point, advanced techniques like semantic chunking, late chunking or sentence-window splitting offer potential improvements to your RAG system.

These methods can enhance both retrieval precision and generation quality, making them valuable tools for tackling complex queries and improving overall system reliability. By understanding and applying these advanced chunking strategies, you can unlock the full potential of your RAG pipeline and ensure better outcomes for your use cases.

Next on the menu: fine-tuning embedding models. This topic is close my heart since it comes close to treating the whole RAG system as a process to be optimized using gradient descent. 📉

Jupyter: Google ColabGoogle Colab