LinuxBender 3 hours ago

User-agent aside there are usually small details bots leave out unless they are using headless chrome of course. Most bots can't do HTTP/2.0 yet all common browsers support it. Most bots will not be sending cors, no-cors, navigate sec_fetch_mode headers whereas browsers do. Some bots do not send a accept_language header. Those are just a few things one can look for and deal with in simple web server ACL's. Some bots do not support http-keepalive, though this can knock out some poor middle boxes if dropping connections that do not support http keepalive.

At the tcp layer some bots do not set MSS options or use very strange values. This can get into false positives so I just don't publish IPv6 records for my web servers and then limit to an MSS range of 1280 to 1460 on IPv4 which knocks out many bots.

There are always the possibilities of false positives but they can be logged and reviewed acceptable losses should the load on the servers get too high. Another mitigating control is to perform analysis on previous logs and use maps to exclude people that post on a regular basis or have logins to the site assuming none of them are part of the problem. If a registered user is part of the problem give them an error page after {n} requests.

teeray 3 hours ago

We need some kind of fail2ban for AI scrapers. Fingerprint them, then share the fingerprint databases via torrent or something.

  • michaelcampbell 29 minutes ago

    For THIS application, would a boring rate-limiter not help? I mean it won't get rid of the DOS part of this right off, but these are not agents/bots MEANING to DOS a site, so if they get a 429 often enough they should back off on their own, no?