,

Could Multi-Billion Token Prompts Make Retrieval-Augmentation Obsolete?

The ability for large language models to understand and generate relevant responses hinges on the amount of context they can process at once—the “prompt window” size. For current models like GPT-4, this is limited to around 8,000 tokens.

To compensate, techniques like Retrieval-Augmented Generation (RAG) have been developed. RAG models first retrieve relevant documents or passages from a corpus based on the input and then condition the language model on this retrieved knowledge along with the original prompt. This allows the model to go beyond its limited context window and draw upon a broader knowledge base, showing impressive performance gains on knowledge-intensive tasks.

But what if we could massively increase prompt window limits from thousands to billions of tokens? For reference Gemini 1.5 has a 1 million token window. Large models may then be able to ingest entire relevant datasets alongside the input prompt during the inference process itself.

Just imagine—instead of retrieving snippets of relevant knowledge, you could simply prompt the model with the full context of Wikipedia, online databases, research papers, and more, concatenated with your input query. The model could directly attend to all of this relevant data at once within its expanded multi-billion token prompt window.

In such a scenario, approaches like RAG, which stitch together retrieved knowledge, could become obsolete. The model would already have the entire relevant corpus ingested in the prompt to begin with.

Of course, achieving prompts of this scale presents extraordinary computational and engineering challenges. We must also carefully consider the speed, maintenance, and cost implications.

While directly ingesting colossal knowledge prompts could be extremely powerful, it may be considerably slower than RAG’s retrieval step at inference time. There are also questions around efficiently updating and maintaining such massive prompts as knowledge evolves.

And the potential compute and financial costs of multi-billion token inference could be staggering, potentially making it impractical except for the highest-stakes usage scenarios.

Techniques like RAG aim for more targeted knowledge retrieval to balance performance, efficiency, and cost. While they may become antiquated for some use cases, their judicious retrieval could remain compelling for many applications where loading the entire relevant data universe is overkill.

The future evolution of large language models and their ability to process vast amounts of information hinges on finding this balance—leveraging both the massive potential of expanded prompt windows and the strategic precision of retrieval-augmentation.