Crawling, or spidering
, is how a web crawler browses the internet automatically. It starts with one webpage (a seed URL
), collects links from it, and follows those links to new pages, repeating the process to gather information.
There are two
primary types of crawling strategies:
Breadth-First Crawling
Breadth-first crawling
explores a website’s width first. It crawls all links on the seed page before moving to the next level of links. This helps get a broad view of the site’s structure and content.
flowchart LR
A(Seed URL) --> B(Page 2)
B --> C(Page 4)
B --> D(Page 5)
A(Seed URL) --> E(Page 3)
E --> F(Page 6)
E --> G(Page 7)
Depth-First Crawling
Depth-first crawling
goes deep first. It follows one link path as far as possible before backtracking to explore others. This helps find specific content or reach deep pages in a website.
flowchart LR
A(Seed URL) --> B(Page 2)
B --> C(Page 3)
C --> D(Page 4)
D --> E(Page 5)
E --> A
Hakrawler
Hakrawler is a fast golang web crawler for gathering URLs and JavaScript file locations.
# Single URL
echo https://example.com | hakrawler
# Multiple URLs
cat urls.txt | hakrawler
# Timeout for each line of stdin after 5 seconds
cat urls.txt | hakrawler -timeout 5
# Send all requests through a proxy
cat urls.txt | hakrawler -proxy http://localhost:8080
# Include subdomains
echo https://example.com | hakrawler -subs
Katana
Katana is a fast crawler focused on execution in automation pipelines offering both headless and non-headless crawling.
# Single URL
katana -u https://example.com
echo https://example.com | katana
# Multiple URLs
katana -u https://example.com,https://example.net
katana -list urls.txt
# Depth
katana -u https://example.com -d 5
# JavaScript parsing
katana -u https://example.com -jc