Show HN: Partial content web crawling using HTTP/2 and Go

altayakkus.substack.com

5 points by biosboiii 13 hours ago

Hi, I wrote a low-level HTTP/2 web crawler in Go, which can scrape partial content to save traffic.

Tl;dr e.g. the HTML of a YouTube video contains the video description, views, likes etc. in its first 600KB, the remaining 900KB are of no use for me, but I have to pay my proxies by the gigabyte.

My crawler receives packet per packet, and if I got everything I needed I reset the request, and only pay-for-what-i-crawled.

This is also potentially useful for large-scale crawling operations, where duplicates matter. You could compute a simHash on the fly, and reset on-the-fly before crawling the entire document (again).