UIUC.chat Documentation
UIUC.chat Main Page
  • Quick start
  • FAQs
  • Features
    • Tool use in conversation
    • Retrieval Methods
    • Web Crawling Details
    • Bulk Export Documents or Conversation History
    • Duplication in Ingested Documents
  • CropWizard
    • CropWizard Documents
    • Pest Detection Tool
    • CropWizard Document License Information
  • API
    • API Keys
    • Endpoints
  • Technical
    • System Architecture
  • Developers
    • Developer Quickstart
Powered by GitBook
On this page
  • Types of Crawls
  • Backend Code

Was this helpful?

  1. Features

Web Crawling Details

PreviousRetrieval MethodsNextBulk Export Documents or Conversation History

Last updated 1 year ago

Was this helpful?

Types of Crawls

Limit Web Crawl Options:

  1. Equal and Below: This option restricts the scraping to pages whose URLs begin exactly with the specified starting point and includes any subsequent pages that follow this path. For example, choosing nasa.gov/blogs will target all blog entries (like nasa.gov/blogs/new-rocket), but it will not include unrelated paths such as nasa.gov/events. It's like following a branch on a tree without jumping to a different branch.

  2. Same Subdomain: When you select this option, the scraper will focus on a specific subdomain, collecting data from all the pages within it. For instance, if you choose docs.nasa.gov, it will explore all the pages under this subdomain exclusively, ignoring pages on nasa.gov or other subdomains like api.nasa.gov. Imagine this as confining the scraping to a single section of a library.

  3. Entire Domain: Opting for this allows the scraper to access all content under the main domain, including its subdomains. Selecting nasa.gov means it can traverse through docs.nasa.gov, api.nasa.gov, and any other subdomains present. Think of it as having a pass to explore every room in a building.

  4. All: This is the most extensive scraping option, where the scraper begins at your specified URL and ventures out to any linked pages, potentially going beyond the initial domain. It's akin to setting out on a web expedition with no specific boundary.

Recommendation: Starting with the "Equal and Below" option is advisable for a focused and manageable scrape. If your needs expand, you can re-run the process with broader options as required.

Backend Code

Web crawling is powered by . Our implementation is .

Happily, I've seen web scraping take place at 10Gbps, using 6 cores of parallel javascript. It's a performant option even with basic hosting on . The baseline 100MB of memory usage costs $1/mo on Railway, pretty nifty.

Crawlee.dev
open source on Github
Railway.app