Retrieval Methods
Last updated
Last updated
In RAG (retrieval augmented generation), it's critical to actually retrieve the necessary context to answer a user query. Here's my favorite AI-forward ideas to find the "needle in the haystack" of relevant documents for a user query.
In order of increasing complexity and novelty:
By default (as of May, 2024) UIUC.chat uses standard vector retrieval lookup. It compares an embedding of the user query
with embeddings of all the documents in the user's project. The top 80 document chunks are used in final LLM call to answer the user query.
"Parent document retrieval" refers to expanding the context around the retrieved document chunks. In our case, for each of the top 5 chunks retrieved, we additionally retrieve the 2 preceding and 2 subsequent chunks to the retrieved chunk. In effect, for the closest matches from standard RAG, we grab a little more "above and below" the most relevant chunks/paragraphs we retrieved.
Intuitively, we say this solves the "off by one" problem. For example, users as questions like what is the solution to the Bernoulli equation?
And say we have a textbook in our document corpus. Naively this query embedding matches with the problem setup in the textbook, but not the problem solution that follows in the subsequent paragraphs. Therefore, we expand the context to ensure we capture both the problem description and solution in our relevant contexts. This works particularly well for textbook questions.
Key idea: use an LLM to diversify your queries, then use an LLM to filter out irrelevant passages before we go to big LLM, e.g. GPT-4 (many times more expensive) for final answer generation.
The retrieval data flow:
User query ->
LLM generates multiple similar queries w/ more keywords ->
Vector KNN Retrieval & reranking ->
Filter out truly bad/irrelevant passages w/ small LLM ->
Get parent docs (see above) for top 5 passages ->
A set of documents/passages for 'final answer' generation w/ large LLM.
Challenge: LLM filtering requires 1-5 seconds at minimum, even fully parallel. Dramatic slowdown to performance. This will get better quickly as small LLMs get smarter & even faster.
Key idea: It's retrieval with function calling to explore the documents.
Mirroring the human research process, we let the LLM decide if the retrieved context is relevant and, more importantly, where to look next. The LLM decides between a set of options like "next page" or "previous page" and similar to explore the document and find the best passages to answer our target question.
LLM Guided retrieval thrives with structured data.
TL;DR:
Start with Grobid. Excellent at parsing sections
, references
.
Re-run with Unstructured. Replace all figures and tables from Grobid with Unstructured, we find their yolox
model is best at parsing tables accurately.
For math (LaTeX), use Nougout.
Based on our empirical testing, and conversations with domain experts from NCSA and Argonne, this is our PDF parsing pipeline for Pumbed, Arxiv, and any typical scientific PDFs.
Grobid is the best at creating outlines from scientific PDFs. The Full-Text module properly segments articles into sections, like 1. introduction, 1.1 background on LLMs, 2. methods... etc.
Precise outlines is crucial to LLM-guided-retrieval, for the LLM to properly request other sections of the paper.
We highly recommend the doc2json wrapper around Grobid to make it easier to use the outputs.
Unstructured is the best at parsing tables. In our experiments with tricky PDFs, YOLOX is slightly superior to Detectron2.
Nougut is the best at parsing mathematical symbols. Excellent at parsing "rendered LaTeX symbols back raw LaTeX code." This method uses an encoder-decoder Transformer model, so realistically it requires a GPU to run.
Storage infra
Store PDFs in an object store, like Minio (a self-hosted S3 alternative).
Store processed text in SQLite, a phenomenal database for this purpose.
Processing Infra
Python main process
Use a Queue of PDFs to process.
Use a ProcessPoolExecutor
to parallelize processing.
Use TempFile objects to prevent the machine's disk from saturating.
Grobid - host an endpoint on a capable server, GPU recommended but not critical.
Unstructured - create a Flask/FastAPI endpoint on a capable server.
Nougat - create a Flask/FastAPI endpoint on a capable server.
Our goal is to parse and store academic PDFs with maximally structured information so that LLM's can adeptly explore the documents by "asking" to view PDF sections on demand.
3 tables to store academic PDFs:
Article
Section
Contexts
Using FastnanoID to quickly generate unique and short random IDs for entries.
Papers
"A paper has sections and references"
Sections (including references)
"A section has contexts"
Contexts
The base unit of text. Each context must fit within an LLM embedding model's context window (typically 8k tokens or more precisely2^13 = 8,192
tokens).
Here's a full SQLite database you can download to explore the final form of our documents. I recommend using DB Browser for SQLite to view the tables.
SQLite, and most SQL implementations, don't allow for a single field to point to an array of foreign keys, so we use the Junction table pattern for our one-to-many relationships.
Junction tables simply allow one article to have many sections
and one section
to have many contexts
.
Article_Sections
table
Section_Contexts
table
We will publish our SQLite files here for PubMed and other academic datasets when available.
NanoID | Num tokens | Title | Date published | Journal | Authors | Sections |
---|---|---|---|---|---|---|
NanoID | Num tokens | Section title | Section number | Contexts |
---|---|---|---|---|
NanoID | Text | Section title | Section number | Num tokens | embedding-nomic_1.5 | Page number | stop reason |
---|---|---|---|---|---|---|---|
[array of pointers to section object]
Mark "ref" if it's reference, otherwise, section numbers
[array of pointers to context object]
<Raw text>
"Section" or "Token limit" if the section is larger than our embedding model context window.