# Duplication in Ingested Documents

There are 2 pathways to ingest new documents into your project - direct file upload and web scrape. We have a content-based matching logic in place to check if the incoming document is already present in the system.&#x20;

The check if performed after the text extraction step in the pipeline and is as follows:&#x20;

* First, Supabase is queried based on either `s3_path` (if direct upload) or `url` (if web scrape).
* If the query doesn't return anything, the incoming document is brand new and is ingested into the database.
* If the query returns some documents, we check for an exact filename or URL match among the documents.
  * If there's no exact filename/URL match, the incoming document is new and is ingested.
  * If there is an exact filename/URL match, we compare the contents of the incoming document to the existing document.
    * If the document contents match, the incoming document is considered a duplicate and is **not** ingested.
    * If the contents do not match, the incoming document is treated as an updated version of the existing older document. The older document is removed from the database and the incoming document is ingested.\ <br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.uiuc.chat/features/duplication-in-ingested-documents.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
