🕷️Web Crawler
0 nodes · 0 edgesHint 1 — locked
Hint 2 — locked
Hint 3 — locked
🕷️
Web Crawler
● Beginner20–30 min
Problem
Build a scalable web crawler that discovers, downloads, and indexes web pages across the internet. It must crawl billions of pages while respecting robots.txt, avoiding duplicate URLs, and storing crawled content for a search index.
Functional Requirements
- Seed with initial URLs and discover new links from crawled pages
- Respect robots.txt and crawl-delay directives
- Deduplicate URLs — never crawl the same page twice
- Store raw HTML and extracted metadata for downstream indexing
- Support priority crawling (important pages crawled more frequently)
Non-Functional
- 10B pages crawled over 30 days (~4K pages/sec)
- Content stored in object storage (petabyte scale)
- < 5% duplicate crawl rate
- Horizontal scaling — add workers without downtime
Prerequisites
QueuesDistributed systems basicsURL deduplication
🕷️
Design your Web Crawler
Drag components from the Brief panel → switch to Components tab
Press enter or space to select a node. You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.
UML
15
Needs Work
Click a dimension to see what's missing. Score updates as you design.
Speed
RPS
Hit Play to start simulation00:00