AI crawlers need to be more respectful

about.readthedocs.com

226 points by pneff 1 year ago

I'm curious if there's a point where a crawler is misconfigured so badly that it becomes a violation of the CFAA by nature of recklessness.

They say a single crawler downloaded 73TB of zipped HTML files in a month. That averages out to ~29 MB/s of traffic, every second, for an entire month.

Averaging 30 megabytes a second of traffic for a month is crossing into reckless territory. I don't think any sane engineer would call that normal or healthy for scraping a site like ReadTheDocs; Twitter/Facebook/LinkedIn/etc, sure, but not ReadTheDocs.

To me, this crosses into "recklessly negligent" territory, and I think should come with government fines for the company that did it. Scraping is totally fine to me, but it needs to be done either a) at a pace that will not impact the provider (read: slowly), or b) with some kind of prior agreement that the provider is accepting responsibility to provide enough capacity.

While I agree that putting content out into the public means it can be scraped, I don't think that necessarily implies scrapers can do their thing at whatever rate they want. As a provider, there's very little difference to me between getting DDoSed and getting scraped to death; both ruin the experience for users.

simonw 1 year ago

"One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."

Wow. That's some seriously disrespectful crawling.

exe34 1 year ago

at that scale, they could just ship some usb disks and a prepaid return postage.
mschuster91 1 year ago

The more interesting thing for me is: the crawler didn't detect it on its own that it racked up 10TB from one site in one day.
If I would design a crawler, I'd keep at least some basic form of tracking, if only to check for people deliberately trolling me by delivering me an infinite chain of garbage.
- bastawhiz 1 year ago
  
  I've been running guthib.mattbasta.workers.dev for years, and in the past few months it's hit the free limit for requests every day. Some AI company is filling their corpus with exactly that: infinite garbage.
stusmall 1 year ago

Respect to them for not naming names, that's the classy move, but I wouldn't blame them if they did.
- ryandrake 1 year ago
  
  Why is it "classy" to not name names when a business (which likely holds it self out there as reputable) behaves badly, especially when it behaves in a way that costs you money? Everyone is so vague and coy. These companies are being abusive and reckless. Name and shame!
  
  WhyCause 1 year ago
  
  In the article, they mention that they are working with the crawling company to be reimbursed for the download costs.
  Naming and shaming the company while you're trying to work with them is a real good way to not get what you want.
- OtherShrezzing 1 year ago
  
  Naming names isn't really required. The hosts have a $5,000 bandwidth fee, but so do the consumers. There's maybe 10 companies with the financial & compute resources to let a $5,000-per-month-per-website bug run rampant before taking the harvesting service offline.
  Meta/Google/Whoever may benefit from economies of scale, so they're not seeing the full $5,000 their side, but they're hitting tens of thousands of sites with that crawler.
  
  energy123 1 year ago
  
  I thought data egress is much more expensive than downloading it?
  
  echoangle 1 year ago
  
  You know you can hit the data rate they were complaining about by using a residential fiber connection, right? 10 TB per day is about 1 Gigabit continuous if I’m not mistaken. There are probably millions of people that could to this if they wanted to.
  
  OtherShrezzing 1 year ago
  
  There's millions of people who could do that to an individual website. There are remarkably few organisations who could do that simultaneously across the top 100,000 or so sites on the internet, which is how readthedocs has encountered this issue.
- Joel_Mckay 1 year ago
  
  It is 100% likely a cloud provider IP range.
  They are a persistent source of spam email servers, scrapers, and bot probes.
  The simple reason is the operators quickly dump a host, and the next user is left wondering why their legitimate site is instantly spelunking spam ban lists.
  It is the degenerative nature of cloud services... and unsurprisingly we end up often banning most parts of Digital Ocean, Amazon, Azure, Google, and Baidu.
  Have a wonderful day, =3
  
  immibis 1 year ago
  
  It's the degenerative nature of assuming an IP corresponds to a user. They have not corresponded to users for over a decade. I once discovered I'm banned on my mobile phone connection from at least one app which doesn't know that CGNAT exists (a very poor assumption for mobile phone apps in particular). If you must block IPs, do it as a last resort, make it based on some observable behavior, quickly instated when that behavior occurs, and quickly uninstated when it does not.
  
  Joel_Mckay 1 year ago
  
  Really depends on the use-case, but yeah the response happens in a proportional manner.
  We also follow the tit-for-tat forgiveness policy to ensure old bans are given a second chance. Mostly, we want the nuisance to sometimes randomly work, as it wastes more of their time fixing bugs.
  And note, if a server is compromised and persistently causing a problem... we won't hesitate to black hole an entire country along with the active Tor exit nodes and known proxies lists (the hidden feature in context cookies).
  Have a great day friend, =3
  
  croemer 1 year ago
  
  Why are you ending all your messages with =3 ?
  
  Joel_Mckay 1 year ago
  
  https://www.jpl.nasa.gov/images/pia22092-arp-142-the-penguin...
  Don't worry about it friend =3
  
  johnnyanmac 1 year ago
  
  Read that before, read it again to make sure I didn't miss anything. The lack of clarity here is disappointing and only asks more questions than it answers.
  
  Joel_Mckay 1 year ago
  
  Don't worry about it friend =3
  
  immibis 1 year ago
  
  What I'm getting from this is that you hate having users almost as much as Reddit (which enshittified their website and banned all non-shit mobile apps and all search engines other than Google).
  
  Joel_Mckay 1 year ago
  
  Imagine a world, where people walk into your business with a mask over their face saying horribly abusive things... while pretending they are your neighbors... And poof... they automatically vanish along with their garbage content.
  They may visit again, but are less likely to mess with the platform. Note, cons never buy anything... ever... it is against their temperament.
  I find it interesting several of cons on YC are upset by someone else's administrative policies. Reddit should enforce these policies too, or at least drop a country or pirate flag icon beside nasty posts...
  Have a great day, and don't fear the ban hammer friend =3
  
  thephyber 1 year ago
  
  What is the term “cons” you use here?
  
  Joel_Mckay 1 year ago
  
  In this context, a few users are scraping YC profile information to attempt various scam/wire-fraud schemes against community members.
  1. masquerading as YC associated organizations
  2. attempting to extract banking information with classic 419'er tactics
  3. spamming member contact routes with several phone and email Grifts
  These folks appear to be operating on a 4 month cycle, and redistributing target leads to other fraudsters.
  Annoying, but hardly a problem limited to YC if that is your concern.
  Best of luck, =3
Joel_Mckay 1 year ago

Try page rate-limiting (6 hits a minute is plenty for a human), and then pop up a captcha.
If they keep hitting the limit within an hour 4+ times, than get fail2ban to block the IP for 2 days.
73TB is a fair amount to have on a cloud... usually at >30TiB firms must diversify with un-metered server racks, and CDN providers (traditional for large media files etc.)
Good luck =3
- DamonHD 1 year ago
  
  When the bots come out of cloud services then the IPs are all over the place: it's much harder to do right these days.
  
  Joel_Mckay 1 year ago
  
  Rate limiting firewalls and spider traps also work well...
  There is page referral monitoring, context cookies, and dynamically created link chaff with depth-charge rules.
  One can dance all day friend, but a black-hole is coming for the entire IP block shortly. And unlike many people, some never remove a providers leased block and route until it is re-sold. =3
  
  immibis 1 year ago
  
  I mean, blocking random IP ranges is your prerogative if you don't want to have customers. The scrapers will find ways around, while actual users will be unable to use your site. Residential proxies are something like $5 for 1000.
  
  Joel_Mckay 1 year ago
  
  True, note domestic ISP IP ranges are published, and unless one deals internationally... don't bother serving people that will never buy anything from your firm anyways.
  Domestic "Users" functioning as proxies will be tripping usage limits, and getting temporarily banned. Google does this by the way, try hammering their services and find out what happens.
  Context cookies also immediately flag egregious multi-user routes, and if it is a ISP IP you know its a problem user. If it is over 15 users an hour per IP, than you can be 100% sure its a Tor proxy.
  we ban over 243000+ IPs, and saw zero impact to our bottom line.
  Have a nice day, =)
  
  DamonHD 1 year ago
  
  Those of us running sites for public information rather than sales cannot make the simple cut-off that you do.
  And again, none of this is simple. It has taken me a few weeks to establish a usage mechanism that does catch the worst feed pullers, but it still can hurt legit new users. That is an opportunity cost.
  
  Joel_Mckay 1 year ago
  
  One must assume most user IP edge proxies are compromised hosts. If someone paid for that list they were almost certainly conned, as the black hats regularly publish that content on their forums. These folks want as many users as possible in order to hide their nuisance traffic origin in the traffic noise.
  Allowing users known to have an active RAT or their "proxy friends" on a commercial site is not helping anyone.... especially the victims.
  https://www.youtube.com/watch?v=aCbfMkh940Q
  Worth studying the problem from time to time when you get bored of the antics.
  These folks are generally uninterested in positively contributing to any community, but rather show up to cause trouble for fun and profit.
  User API quotas are popular for a reason. =3
  
  EVa5I7bHFq9mnYK 1 year ago
  
  Exactly, I'm banned or captchaed from half of all web sites these days, because of the AI.
  
  Joel_Mckay 1 year ago
  
  Try updating your web browser, as sites often flag outdated user agent strings hard-coded in many bots/spiders.
  Have a nice day, =3
  
  EVa5I7bHFq9mnYK 1 year ago
  
  I am sure that bots/spiders use the very best and latest and least suspicious user agent strings.
  
  DamonHD 1 year ago
  
  Some bad actors do indeed use ancient UA strings.
  
  immibis 1 year ago
  
  So do some users. Thanks for banning me.
  
  Joel_Mckay 1 year ago
  
  The captcha trigger events on many sites will often keep nagging/blocking people till they update.
  Don't take this trend personally friend, as if we see a fake iPhone sporting bandwidth >1300Mbps+... than the host is getting permanently banned anyway.
  Have a wonderful day, =3
- pests 1 year ago
  
  6 hours a minute? Are you joking?
  Click around a few times on any of your sites and looks like I'll be banned?
  Multi tasking? Opening multiple interesting links?
  Like what.
  
  Joel_Mckay 1 year ago
  
  Generally for sites:
  1. gets incrementally slower until firewall user rate limiting tokens refill the bucket (chokes >6MiB/min bandwidth use, and enforces abnormal traffic ban rules.)
  2. Pauses serving a page if you spider though 6+ pages a minute (chokes speculative downloading)
  3. if you violate site usage rules 4+ times in the past hour, than your get a 2 day IP ban
  4. if you trip a spider trap, than you get a 5 day ban
  5. If you are issued more than 5 context cookies, than the IP will get spammed with a captcha on every page for 5 days
  6. If you violate any number of additional signatures (shodan etc.) than you get your IP block and route permanently banned. There is only 1 exception to this rule, and we don't share that with anyone.
  7. The site content navigation is programmatically generated in JavaScript
  8. The legal notice is very real for some people
  Have a nice day friend, =)
  
  croemer 1 year ago
  
  Who is "we"?
  
  Joel_Mckay 1 year ago
  
  Plural of the deployment team admins.
  Don't worry about it friend =3
- xcv123 1 year ago
  
  > 73TB is a fair amount to have on a cloud
  From the article:
  "This was a bug in their crawler that was causing it to download the same files over and over again."
  
  Joel_Mckay 1 year ago
  
  Must be a CDN choice issue, as most sites limit per IP daily file downloads, or have an account login with a quota limit.
  People hitting the same files again sounds like a developer testing their code.
Faaak 1 year ago

5000$ for 73TiB seems excessive tough? Some EU cloud providers would price it 10x cheaper
- immibis 1 year ago
  
  Hetzner is $1 per TB and that's only if they decide your overall usage is excessive and needs to be billed for.
immibis 1 year ago

That's also $73 worth of bandwidth on another server host. Please stop using extremely expensive hosts and then blaming other people for the consequences of your decision.
gnfargbl 1 year ago

10TB/day is, roughly, a single saturated 1Gbit link. In technical bandwidth terms, that is the square root of fuck all.
The crazy thing here is that the target site is content to pay ~$700 for the volume of traffic that you can move through a single teeny-tiny included-at-no-extra-charge cat5 Ethernet link in one single day. And apparently, they're going to continue doing so.
- Saris 1 year ago
  
  Yeah that's an insane price for bandwidth, they need to move providers ASAP if that's the kind of fees they get.
  
  bastawhiz 1 year ago
  
  Hosting documentation shouldn't need that much bandwidth. It's text and zip files full of text. Without bots, that's a very small cost even if the bytes are relatively costly.

DamonHD 1 year ago

Not just AI: here is my current side-quest: https://www.earth.org.uk/RSS-efficiency.html

Over 99% of the bandwidth (and CPU) taken by the biggest podcast / music services simply on polling feeds is completely unnecessary. But ofc pointing this out to them gets some sort of "oh this is normal, we don't care" response because they are big enough to know that eg podcasters need them.

immibis 1 year ago

Can you insert a podcast item that says "you are spamming our server - please stop"?
- DamonHD 1 year ago
  
  Not easily on a static server and not without the risk of annoying an actual real human listener!
  If you look at the "Defences" section: https://www.earth.org.uk/RSS-efficiency.html#Hints you'll see there are some things that can be done, such as randomly rejecting a large fraction of requests that don't allow compression (gzip is madly effective on many feed files: it's rude for a client not to allow it). But all these measures take effort to set up, and don't stop the bad bots making the request 100s of times too often. Just responding to each stupid request forces a flurry of packets and wakes up and uses CPU...
bastawhiz 1 year ago

I run pinecast.com. If there was a leaderboard for hn users serving XML, I'd almost certainly be in the top five.
I don't disagree with your post. But: RSS downloads are at an all time low, and that's a bad thing.
They're at an all time low because Spotify and Apple both fetch feeds from centralized servers. 1000 subscribers no longer means 24000ish daily feed fetches, it means 48. With keep alive or H2, these services simply don't reconnect. The number of IPs that hit me from Apple, for instance, is probably only double digits.
Since Apple and Spotify both sit between me and the listeners, they eliminate the privacy that listeners would otherwise enjoy. It also forces podcasters to go to them to find out how many people are subscribed, which means lots of big databases instead of one database that I host for my customers.
Centralization of feed checking carries huge risks, in my opinion, especially as both Apple and Spotify make moves to also become the hosting providers.
- DamonHD 1 year ago
  
  Do you have a CDN between you and Apple / Spotify? Because if you do I think that Apple/Spotify are polling that CDN every few minutes and the CDN is having its bandwidth wasted invisibly, but presumably priced in.
  Also I agree that the re-centralisation is a bad thing, mainly.
  (I'd like to move to email to discuss this further, if possible: I have an arXiv paper to write!)
  
  bastawhiz 1 year ago
  
  The data I'm giving you is based on logs from the CDN. Most feeds are checked by Apple and Spotify every hour, but usually it's less frequently rather than more: shows that haven't been published to in a year or more might see very infrequent feed checks.

mikae1 1 year ago

HellPot: https://github.com/yunginnanet/HellPot

> Clients (hopefully bots) that disregard robots.txt and connect to your instance of HellPot will suffer eternal consequences. HellPot will send an infinite stream of data that is just close enough to being a real website that they might just stick around until their soul is ripped apart and they cease to exist.

levkk 1 year ago

This doesn't solve their bandwidth costs which is their real problem with these bots.
- jesprenj 1 year ago
  
  My 100 mbps upload bandwidth at home is free (apart from the monthly 35€ payment). Useless bots will get stuck downloading from me instead of hogging readthedocs.
  
  bastawhiz 1 year ago
  
  I think your ISP will cut you off long before they stop pulling content from you

Venn1 1 year ago

I blocked Microsoft/OpenAI a few weeks ago for (semi) childish reasons. Seven months later, Bing still refuses to index my blog, despite scraping it daily. The AI scrapers and crawlers toggle on Cloudflare did the trick.

winddude 1 year ago

Not only that, even commoncrawl had issues (about a year ago) where AWS couldn't keep up with the demand for downloading the WARCs.

As someone who written a lot of crawling infrastructure and managed large scale crawling operations, respectful crawling is important.

That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value. It sometimes felt hard to reach out to websites and ask them to allow our crawlers, and grey tactics were often used. And I'm always for a more open internet.

I think regular releases of content in a compressed format would go a long way, but there would always be a race for having the freshest content. What might be better is offering the content in machine format, XML or JSON or even SOAP. Which is usually better for what the sites crawling want to achieve, cheaper for you to serve, and cheaper and less resource intensive compared to crawling. (Have them "cache" locally by enforcing rate limiting and signup)

apantel 1 year ago

> That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value.
VCs and other startup culture evangelists are always challenging founders to figure out what their ‘unfair advantage’ is.
That’s the name of the game.

pants2 1 year ago

While the crawling is disrespectful, it seems RTD could find a cheaper host for their files. At my work we have a 10G business fiber line and serve >1PB per month for around $1,500. Takes 90% of the load off our cloud services. Took me just a couple weeks to set up everything.

stusmall 1 year ago

They normally aren't serving from webservices but from subsidized CDNs.
persedes 1 year ago

If I understood correctly, they have a CDN that normally takes care of it, there were just some links that were not ported / covered by the CDN yet?

exhaze 1 year ago

Having built an AI crawler myself for first party data collection:

1. I intentionally made sure my crawler was slow (I prefer batch processing workflows in general, and this also has the effect of not needing a machine gun crawler rate)

2. For data updates, I made sure to first do a HEAD request and only access the page if it has actually been changed. This is good for me (lower cost), the site owner, and the internet as a whole (minimizes redundant data transfer volume)

Regarding individual site policies, I feel there’s often a “tragedy of the commons” dilemma for any market segment subject to aggregator dominance:

- individual sites often aggressively hide things like pricing information and explicitly disallow crawlers from accessing them

- humans end up having to access them: this results in a given site either not being included at all, or accessed once but never reaccessed, causing aggregator data to go stale

- aggregators often outrank individual sites due to better SEO and likely human preference of aggregators, because it saves them research time

- this results in the original site being put at a competitive disadvantage in SEO, since the their product ends up not being listed, or listed with outdated/incorrect information

- that sequence of events leads to negative business outcomes, especially for smaller businesses who often already have a higher chance of failure

Therefore, I believe it’s important to have some sort of standard policy that is implemented and enforced at various levels: CDNs, ISPs, etc.

The policy should be carefully balanced to consider all these factors as well as having a baked in mechanism for low friction amendment based on future emergent effects.

This would result in a much better internet, one that has the property of GINI regulation, ensuring well-distributed outcomes that are optimized for global socioeconomic prosperity as a whole.

Curious to hear others’ perspectives about this idea and how one would even kick off such an ambitious effort.

int3 1 year ago

Shouldn't all sites have some kind of bandwidth / cost limiting in place? Not to say that AI crawlers shouldn't be more careful, but there are always malicious actors on the internet, seems foolish not to have some kind of defense in place

DamonHD 1 year ago

It's harder to do right then you think. The first dynamic bandwidth (and concurrent connection) limiter that I wrote was to protect a site against Google in part!
jsheard 1 year ago

The big three cloud providers (AWS/GCP/Azure) have collectively decided that you don't want to set a spending limit actually, so they simply don't let you.
- immibis 1 year ago
  
  The big three cloud providers are the most expensive by a factor of 10-100x, and shouldn't be used under any circumstances unless you really, really need specific features from them.
- Saris 1 year ago
  
  Isn't running a webserver on those kind of a silly idea for that reason?
bo1024 1 year ago

They say this:
> We have IP-based rate limiting in place for many of our endpoints, however these crawlers are coming from a large number of IP addresses, so our rate limiting is not effective.
Do you have something else in mind? Just shut down the whole site after a certain limit?

mateozaratefw 1 year ago

Tik Tok crawler fucked us up by taking a product-name (e-commerce) and inserting it into the search bar recursively with the results page. Respect the game, but not respecting the robots.txt crawl delay is awful.

troupo 1 year ago

What amazes me that none of this is surprising, all this behavior (not just what's described in the post) is on par with what the companies are doing, and have been doing for decades... And yet there will be many people, including here on HN, who will just cheer these companies on because they spit out an "opensource model" or a 10-dollars-a-month subscription

influx 1 year ago

Do you feel the same way about Google spidering for their commercial search engine?
- BadHumans 1 year ago
  
  Google sent traffic to your website. You were at least getting something in return.
  
  ToucanLoucan 1 year ago
  
  Emphasis on were, since they've made their search such utter shit in the quest for ad revenue that they're now going to have their own AI sum up your results (badly) instead to attempt to solve the problem they created.
  
  hmry 1 year ago
  
  Yeah, and when Google stopped sending traffic, people sued them, as they should.
- LukeShu 1 year ago
  
  I don't.
  Just 3 AI spiders put more load on our servers than all search engine spiders and all human traffic combined.
  Some numbers I have handy from before I blocked the bots:
  ClaudeBot drove more requests through our Redmine in a month than it saw in the combined 5 years prior to ClaudeBot.
  Bytespider accounted for 59% of the total traffic to our Git server.
  Amazonbot accounted for 21% of the total traffic to our Git server.
  Google has never even been close to breaking out of the single-digit-percentages of any metric.
  
  DamonHD 1 year ago
  
  Generally Googlebot is well behaved and efficient these days, though I have discovered that it is currently horribly broken around 429 / 503 response codes... And pays no attention to Retry-After either... Also Google-Podcast which is meant to have been turned off!
  
  o11c 1 year ago
  
  Someone needs to start adding all these AI's homepages to the browser "malware" lists.
- Saris 1 year ago
  
  Not originally since they just sent people to the site.
  But these days where they just rip content from the site to give people as answers, completely depriving the site of traffic, yeah that seems basically just as bad as the AI bots.
talldayo 1 year ago

The situation didn't change when it was search index crawlers being called-out. At the end of the day, this sort of "abuse" is native to the world of the internet; like you said, it's decades old at this point.
HN will cheer on a lot of things that are counter-intuitive to their wellbeing; open-weight models doesn't feel like one of them. You can't protest AI (or search engines) because after long enough people can't do their job without them. The correct course of action is to name-and-shame, not write pithy engineering blogs begging people to stop. People won't stop.
- troupo 1 year ago
  
  > HN will cheer on a lot of things that are counter-intuitive to their wellbeing; open-weight models doesn't feel like one of them.
  Except for the fact that they come from undisclosed sources from a company that does this: https://x.com/Tantacrul/status/1794863603964891567
  
  talldayo 1 year ago
  
  And I'll be damned if Tanta isn't having his tweets used for training Elon's AI. The parochial circle regresses as it goes around, I'm done acknowledging the make-believe barriers we pretended the internet clung to.
  You post it, others consume it. Same as it ever was.
ToucanLoucan 1 year ago

Exploitation of the common man being a key ingredient of a product has never once inspired any actual consumer revolt. Fundamentally, no matter what they say, as long as they get their fleeting hit of dopamine from buying/using a thing, people just don't care.
As Squidward says: nobody gives a care for the fate of labor as long as they get their instant gratifications.

johneth 1 year ago

> "One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."

Invoice the abusers.

They're rolling in investor hype money, and they're obviously not spending it on competent developers if their bots behave like this, so there should be plenty left to cover costs.

bo1024 1 year ago

It would be nice to have more of a peer-to-peer infrastructure (torrent inspired) for serving big resources.

immibis 1 year ago

Never going to happen due to misaligned incentives. In 2024 everyone wants to keep their data behind lock and key. The commons is gone. Just look at the Google/Reddit deal.

RecycledEle 1 year ago

I suspect the AI companies are as careless with their training as they are with their web scraping.

Joel_Mckay 1 year ago

Had a conversation with a firm that wanted a distributed scraper built, and they really did not care about site usage policies.

You would be fooling yourselves if you think such a firm cared about robots.txt or page tags.

We warned them they would be sued eventually, to contact the site owners for legal access to the data, and issued a hard pass on the project. Probably they assumed if the indexing process was out of another jurisdiction their domestic firm wouldn't be liable for theft of service or copyright infringement.

It was my understanding AI/ML does not change legal obligations in business, but the firm probably found someone to build that dubious project eventually...

Spider traps and rate-limiting are good options too. =3

reaperman 1 year ago

Robots.txt doesnt create a legal obligation. It’s just a set of rules saying “if you don’t follow these rules to politely crawl our site, we’ll block you from crawling our site”.
Obviously “anything goes” in civil suits however - if someone is being absurdly egregious with their crawling there’s usually some exposure to one tort or another.
- Joel_Mckay 1 year ago
  
  The posted site access/usage policy is legally enforceable in most jurisdictions as far as I know...
  And Reddit has definitely become more proactive about scrapers. ;-)
  
  echoangle 1 year ago
  
  Legally enforceable as in “you can block people who don’t follow the policy” or enforceable as in “you can sue them for money”?
  
  Joel_Mckay 1 year ago
  
  If I recall it is considered theft-of-service if you bypass the posted site usage terms with an agent like a spider, and certainly a copyright violation for unauthorized content usage (especially in the context of a commercial venture.)
  One may be sued, but not because you parsed robots.txt wrong =3
  
  reaperman 1 year ago
  
  > it is considered theft-of-service if you bypass the posted site usage terms
  My understanding is that this is not accurate.
  HiQ v LinkedIn established that this is only the case if you actually agreed to the terms of service. Such "agreement" only happens if the information is walled behind an account creation process, e.g. Facebook, Inc. v. Power Ventures, Inc. If it's just scraping publicly available webpages, the only legal issue with scraping would be unreasonably or obviously negligent scraping practices which lead to degradation or denial-of-service. And obviously the line for that would have to be determined in civil court.
  eBay v. Bidder's Edge (2000) is the last case that I could find which even considered violation of robots.txt as very minor part of the judgement, but the findings were based far more on other things. Intel Corp. v. Hamidi also implicitly overruled the judgement in that that ruling (though not related to robots.txt, which was really just a very minor point in the first place).
  
  Joel_Mckay 1 year ago
  
  Hard to say, I seem to recall it was because some spider authors used session cookies to bypass the EULA (the page probe auto-clicks "I agree" to capture the session cookie), and faked user agent strings to spoof gogglebot to gain access to site content.
  One thing is for certain, is its jurisdictional... and way too messy to be responsible for maintaining/hosting (the ambiguous copyright protection outside a research context looked way too risky.) =3
  
  reaperman 1 year ago
  
  Please review HiQ vs. LinkedIn - it hinged on the fact that HiQ hired crowdsourced workers (“turkers”) to create fake profiles through which to access LinkedIn’s platform (who had to agree to the ToS to create these accounts). The court found that hiQ expressly agreed to the user agreement when it created its corporate account on LinkedIn’s platform.
  This doesn't apply if you don't ever agree to anything - which is the case if the information is not locked behind account creation.
  
  Joel_Mckay 1 year ago
  
  This gets complex fast, as a click-army is not necessarily violating the EULA.
  However, if they scraped the content using these account credentials, than it becomes a problem in a commercial context.
  If I recall, only journalists and academics could argue Fair use at that point.
  Anyway, I didn't touch the project mostly for copyright and trademark risk concerns.
  Have a great day =3
Atotalnoob 1 year ago

AI/ML has created a race to suck all the data on the internet regardless of copyright or status and use it.
OpenAI introduced gptbot in August 2023… they already took everything
- Joel_Mckay 1 year ago
  
  Unlikely, site-generator hosts are still happily providing a limitless supply of remixed well-structured nonsense, random images with noise, and valid links to popular sites.
  In this case, they showed up to the data buffet long after it went rotten due to SEO.
  Have a nice day =3

bakugo 1 year ago

The words "AI" and "respectful" don't belong in the same sentence. The mere concept of generative AI is disrespectful.

apantel 1 year ago

The concept of generative AI is not disrespectful. You could have generative AI trained on licensed data - would that be disrespectful?

asdasdsddd 1 year ago

Why can't you just rate limit non-browser user agents very aggressively, if your primary audience is human.

guhcampos 1 year ago

We all know wheret his is headed. In 5 years every useful content in the Web will be behind a paywall.

surfingdino 1 year ago

Served as word vomit.

andrei_says_ 1 year ago

They will be once we legislate that respect. Not a second earlier.

surfingdino 1 year ago

Sites need to start suing crawler operators for bandwidth costs.

echoangle 1 year ago

How is that supposed to work? Under which law will you force me to pay for using your public website, especially if I am not in your country? Just put up a captcha and block crawlers, you’re not going to get them to pay you

joshu 1 year ago

everything old is new again. i remember when someone at google started aggressively crawling del.icio.us from a desktop machine and i ended up blocking all employees...

Pesthuf 1 year ago

Have capitalists ever stopped just because their actions (that make them money) hurt others? Because the consequences of the damage they cause might in the end hurt them, too?

Hoping they just stop seems futile…

xyst 1 year ago

So much waste. Even worse than the digital currency rush.

croemer 1 year ago

Just 2 buggy crawlers seems not that many, sure they each had large impact, but given that there are likely hundreds if not thousands of such crawlers out there it's a rather small number. It seems that most crawlers are actually respectful.

Centigonal 1 year ago

How many foundation AI companies are there? 2 is a pretty big chunk of that pie.
- corbet 1 year ago
  
  I think there are a thousand wannabe companies all trying to suck up as much data as they can; not a sustainable situation in any way.
  
  PaulHoule 1 year ago
  
  There was a paper about webcrawlers circa 2000 that pointed out that the vast majority of academics who ran webcrawlers never published a paper based on their work.
  
  morkalork 1 year ago
  
  Sounds like all the kids who want to make a video-game and start with building a game engine.
- croemer 1 year ago
  
  What makes you think that only a dozen or so foundation AI companies are scraping?
PaulHoule 1 year ago

I used to run a site with a huge number of pages that had high running costs but low revenue.
The only web crawler that did anything for me was Google, as Google sent an appreciable amount of traffic. Referrers from Bing were almost undetectable: the joke among my black hat SEO friends at the time was that you could rank for money keywords like "buy wow gold" and get 10 hits. Then there were the Chinese crawlers like Baidu that would crawl at 10x the rate of Google but send zero referrers. And then there were crawlers looking for copyrighted images that cost me money to accommodate even if they never sent me cease and desist letters.
As much as I hate the Google monopoly I couldn't afford having my site crawled like that without any benefit to me.
It's an awful situation for the long term though because it prevents new entrants. Right now I am thinking about a new search engine for a vertical where a huge number of products are available from different vendors and when you do find results from Google they are sold out at least 70% of the time. I hate to think it's going to get harder to make something.