Show HN: Partial content web crawling using HTTP/2 and Go
altayakkus.substack.comHi, I wrote a low-level HTTP/2 web crawler in Go, which can scrape partial content to save traffic.
Tl;dr e.g. the HTML of a YouTube video contains the video description, views, likes etc. in its first 600KB, the remaining 900KB are of no use for me, but I have to pay my proxies by the gigabyte.
My crawler receives packet per packet, and if I got everything I needed I reset the request, and only pay-for-what-i-crawled.
This is also potentially useful for large-scale crawling operations, where duplicates matter. You could compute a simHash on the fly, and reset on-the-fly before crawling the entire document (again).