Crawling, or spidering, is how a web crawler browses the internet automatically. It starts with one webpage (a seed URL), collects links from it, and follows those links to new pages, repeating the process to gather information.

There are two primary types of crawling strategies:

Breadth-First Crawling

Breadth-first crawling explores a website’s width first. It crawls all links on the seed page before moving to the next level of links. This helps get a broad view of the site’s structure and content.

flowchart LR
    A(Seed URL) --> B(Page 2)
    B --> C(Page 4)
    B --> D(Page 5)
    A(Seed URL) --> E(Page 3)
    E --> F(Page 6)
    E --> G(Page 7)

Depth-First Crawling

Depth-first crawling goes deep first. It follows one link path as far as possible before backtracking to explore others. This helps find specific content or reach deep pages in a website.

flowchart LR
    A(Seed URL) --> B(Page 2)
    B --> C(Page 3)
    C --> D(Page 4)
    D --> E(Page 5)
    E --> A

Hakrawler

Hakrawler is a fast golang web crawler for gathering URLs and JavaScript file locations.

# Single URL
echo https://example.com | hakrawler
 
# Multiple URLs
cat urls.txt | hakrawler
 
# Timeout for each line of stdin after 5 seconds
cat urls.txt | hakrawler -timeout 5
 
# Send all requests through a proxy
cat urls.txt | hakrawler -proxy http://localhost:8080
 
# Include subdomains
echo https://example.com | hakrawler -subs

Katana

Katana is a fast crawler focused on execution in automation pipelines offering both headless and non-headless crawling.

# Single URL
katana -u https://example.com
echo https://example.com | katana
 
# Multiple URLs
katana -u https://example.com,https://example.net
katana -list urls.txt
 
# Depth
katana -u https://example.com -d 5
 
# JavaScript parsing
katana -u https://example.com -jc

☠️ ζHUB

Explorer

Crawling

Breadth-First Crawling

Depth-First Crawling

Hakrawler

Katana

Explorer

Table of Contents

Backlinks