I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.
It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.
> The latest was a slow loris approach where it takes forever for robots.txt to download.
I'd treat this in a client the same way as I do in a server application. If the peer is behaving maliciously or improperly, I silently drop the TCP connection without notifying the other party. They can waste their resources by continuing to send bytes for the next few minutes until their own TCP stack realizes what happens.
I really appreciate you giving a shit. Not sarcastically -- it seems like you're actually doing everything right, and it makes a difference.
Gating robots.txt might be a mistake, but it also might be a quick way to deal with crawlers who mine robots.txt for pages that are more interesting. It's also a page that's never visited by humans. So if you make it a tarpit, you both refuse to give the bot more information and slow it down.
It's crap that it's affecting your work, but a website owner isn't likely to care about the distinction when they're pissed off at having to deal with bad actors that they should never have to care about.
> It's also a page that's never visited by humans.
Never is a strong word. I have definitely visited robots.txt of various websites for a variety of random reasons.
- remembering the format
- seeing what they might have tried to "hide"
- using it like a site's directory
- testing if the website is working if their main dashboard/index is offline
Yes. I have checked many checkboxes that say "Verify You Are a Human" and they have always confirmed that I am.
In fairness, however, my daughters ask me that question all the time and it is possible that the verification checkboxes are lying to me as part of some grand conspiracy to make me think I am a human when I am not.
I usually hit robots.txt when I want to make fetch requests to a domain from the console without running into CORS or CSP issues. Since it's just a static file, there's no client-side code interfering, which makes it nice for testing. If you're hunting for vulnerabilities it's also worth probing (especially with crawler UAs), since it can leak hidden endpoints or framework-specific paths that devs didn't expect anyone to notice.
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
I don't think you have any idea how serious the issue is. I was loosely speaking in charge of application-level performance at one job for a web app. I was asked to make the backend as fast as possible at dumping the last byte of HTML back to the user.
The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.
60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.
Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.
There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.
You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.
Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.
> Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times.
Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):
One of our customers was paying a third party to hit our website with garbage traffic a couple times a week to make sure we were rejecting malformed requests. I was forever tripping over these in Splunk while trying to look for legitimate problems.
We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.
I guess my position it was comparatively well behaved? There were bots that would full speed blitz the website, for absolutely no reason. You just scraped this page 27 seconds ago, do you really need to check it for an update again? Also it hasn't had a new post in the past 3 years, is it really going to start being lively again?
if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.
It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.
The dumbest part is that we’d known about this for a long time and one day someone discovered we’d implemented a feature toggle to remove those URLs and then it just never got turned on, despite being announced that it had.
They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.
I thought the argument was that if you run on gcp you can masquerade as googlebot and not get a 429 which is obviously false. Instead it looks like the argument is more of a tinfoil hat variety.
btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?
My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment
every single IPv4 address in existence receives constant malicious traffic, from uncountably many malicious actors, on all common service ports (80, 443, 22, etc.) and, for HTTP specifically, to an enormous and growing number of common endpoints (mostly WordPress related, last I checked)
if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else
you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else
also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible
I was kind of amazed to learn that apparently if you connect Windows NT4/98/2000/ME to a public IPv4 address it gets infected by what is a period correct worm in no time at all. I don't mean that someone uses an RCE to turn it into part of a botnet (that is expected), apparently there are enough infected hosts from 20+ years ago still out there that the sasser worm is still spreading.
I still remember how we installed Windows PCs at home if no media with the latest service pack was available. Install Windows, download service pack, copy it away, disconnect from internet, throw away everything and install Windows again...
I've heard this point raised elsewhere, and I think it's underplaying the magnitude of the issue.
Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.
The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.
AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
do we know they didn't download the DB? Maybe the new traffic is the LLM reading the site? (not the training)
I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.
Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.
I would guess site operators can tell the difference between an exhaustive crawl and the targeted specific traffic I'd expect to see from an LLM checking sources on-demand. For one thing, the latter would have time-based patterns attributable to waking hours in the relevant parts of the world, whereas the exhaustive crawl traffic would probably be pretty constant all day and night.
Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.
this is a completely fair point, it may be the case that AI scraper bots have recently made the magnitude and/or details of unwanted bot traffic to public IP addresses much worse
but yeah the issue is that as long as you have something accessible to the public, it's ultimately your responsibility to deal with malicious/aggressive traffic
> At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward
What's worse is when you get bots blasting HTTP traffic at every open port, even well known services like SMTP. Seriously, it's a mail server. It identified itself as soon as the connection was opened, if they waited 100ms-300ms before spamming, they'd know that it wasn't HTTP because the other side wouldn't send anything at all if it was. There's literally no need to bombard a mail server on a well known port by continuing to send a load of junk that's just going to fill someone's log file.
I remember putting dummy GET/PUT/HEAD/POST verbs into SMTP Relay softwares a quarter of a century ago. Attackers do not really save themselves time and money by being intelligent about this. So they aren't.
There are attackers out there that send SIP/2.0 OPTIONS requests to the GOPHER port, over TCP.
I have a service running on a high port number on just a straight IPv4 and it does get a bit of bot traffic, but they are generally easy to filter out when looking at logs (well behaved ones have a domain in their User-Agent and bingbot takes my robots.txt into account. I dont think I've seen the Google crawler. Other bots can generally be worked out as anything that didn't request my manifest.json a few seconds after loading the main page)
I have a small public gitea instance that got thousands of requests per hour from bots.
I encountered exactly one actual problem: the temporary folder for zip snapshots filled up the disk since bots followed all snapshot links and it seems gitea doesn't delete generated snapshots. I made that directory read-only, deleted its contents, and the problem was solved, at the cost of only breaking zip snapshots.
I experienced no other problems.
I did put some user-agent checks in place a while later, but that was just for fun to see if AI would eventually ingest false information.
Yes and it makes reading your logs needlessly harder. Sometimes I find an odd password being probed, search for it on the web and find an interesting story, that a new backdoor was discovered in a commercial appliance.
In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.
I remember back before ssh was a thing folks would log login attempts -- it was easy to get some people's passwords because it was common for them to accidentally use them as the username (which are always safe to log, amirite?). All you had to do was watch for a failed login followed by a successful login from the same IP.
Sure, why not. Log every secret you come across (or that comes across you). Just don't log your own secrets. Like OP said, it lead down some interesting trails.
Just about nobody logs passwords on purpose. But really stupid IoT devices accept credentials as like query strings, or part of the path or something, and it's common to log those. The attacker is sending you passwords meant for a much less secure system.
You probably shouldn't log usernames then, or really any form fields, as users might accidentally enter a password into one of them. Kind of defeats the point of web forms, but safety is important!
A password being put into a normal text field in a properly submitted form is a lot less likely than getting into some query or path. And a database is more likely to be handled properly than some random log file.
If the caller puts it in the query string and you log that? It doesn't have to be valid in your application to make an attacker pass it in.
So unless you're not logging your request path/query string you're doing something very very wrong by your own logic :).
I can't imagine diagnosing issues with web requests and not be given the path + query string. You can diagnose without but you're sure not making things easier
Does an attacking bot know your webserver is not a misconfigured router exposing its web interface to the net? I often am baffled what conclusions people come up with from half reading posts. I had bots attack me with SSH 2.0 login attempts on port 80 and 443. Some people underestimate how bad at computer science some skids are.
Also baffled that three separate people came to that conclusion. Do they not run web servers on the open web or something? Script kiddies are constantly probing urls, and urls come up in your logs. Sure it would be bad if that was how your app was architected. But it's not how it's architected, it's how the skids hope your app is architected. It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??
> It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??
You're absolutely right. That's my mistake — you are requesting a specific version of WordPress, but I had written a Rails app. I've rewritten the app as a WordPress plugin and deployed it. Let me know if there's anything else I can do for you.
> Do they not run web servers on the open web or something?
Until AI crawlers chased me off of the web, I ran a couple of fairly popular websites. I just so rarely see anybody including passwords in the URLs anymore that I didn't really consider that as what the commenter was talking about.
Just about every crawler that tries probing for wordpress vulnerabilities does this, or includes them in the naked headers as a part of their deluge of requests.
Running ssh on 80 or 443 is a way to get around boneheaded firewalls that allow http(s) but block ssh, so it's not completely insane to see probes for it.
I recall finding weird URLs in my access logs way back when where someone was trying to hit my machine with the CodeRed worm, a full decade after it was new. That was surreal.
plaintextPassword = POST["password"]
ok = bcryptCompare(hashedPassword, plaintextPassword)
// (now throw away POST and plaintextPassword)
if (ok) { ... }
Bonus points: on user lookup, when no user is found, fetch a dummy hashedPassword, compare, and ignore the result. This will partially mitigate username enumeration via timing attacks.
I believe you may have misinterpreted the comment. They're not talking about logs that were made from a login form on their website. They're talking about generic logs (sometimes not even web server logs) being generated because of bots that are attempting to find vulnerabilities on random pages. Pages that don't even exist or may not even be applicable on this server.
Thousands of requests per hour? So, something like 1-3 per second?
If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.
Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.
I love the snark here. I work at a hosting company and the only customers who have issues with crawlers are those who have stupidly slow webpages. It’s hard to have any sympathy for them.
We were only getting 60% of our from bots at my last place because we throttled a bunch of sketchy bots to around 50 simultaneous requests. Which was on the order of 100/s. Our customers were paying for SEO so the bot traffic was a substantial cost of doing business. But as someone tasked with decreasing cluster size I was forever jealous of the large amount of cluster thatwasn’t being seen by humans.
One of the most common issues we helped customers solve when I worked in web hosting was low disk alerts, usually because the log rotation had failed. Often the content of those logs was exactly this sort of nonsense and had spiked recently due to a scraper. The sheer size of the logs can absolutely be a problem on a smaller server, which is more and more common now that the inexpensive server is often a VM or a container.
That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.
Serving file content/diff requests from gitea/forgejo is quite expensive computationally. And these bots tend to tarpit themselves when they come across eg. a Linux repo mirror.
> Serving file content/diff requests from gitea/forgejo is quite expensive computationally
One time, sure. But unauthenticated requests would surely be cached, authenticated ones skip the cache (just like HN works :) ), as most internet-facing websites end up using this pattern.
There are _lots_ of objects in a large git repository. E.g., I happen to have a fork of VLC lying around. VLC has 70k+ commits (on that version). Each commit has about 10k files. The typical AI crawler wants, for every commit, to download every file (so 700M objects), every tarball (70k+ .tar.gz files), and the blame layer of every file (700M objects, where blame has to look back on average 35k commits). Plus some more.
Saying “just cache this” is not sustainable. And this is only one repository; the only reasonable way to deal with this is some sort of traffic mitigation, you cannot just deal with the traffic as the happy path.
You can't feasibly cache large reposotories' diffs/content-at-version without reimplementing a significant part of git - this stuff is extremely high cardinality and you'd just constantly thrash the cache the moment someone does a BFS/DFS through available links (as these bots tend to do).
We were seeing over a million hits per hour from bots and I agree with GP. It’s fucking out of control. And it’s 100x worse at least if you sell vanity URLs, because the good bots cannot tell that they’re sending you 100 simultaneous requests by throttling on one domain and hitting five others instead.
Thousands per hour is 0.3-3 requests per second, which is... not a lot? I host a personal website and it got much more noise before LLMs were even a thing.
The way I get a fast web product is to pay a premium for data. So, no, it's not "lost time" by banning these entities, it's actual saved costs on my bandwidth and compute bills.
The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.
You are paying premium for data? Do you mean for traffic? Sounds like a bad deal to me. The tiniest Hetzner servers give you 20TB included per month. Either you really have lots of traffic, or you are paying for bad hosting deals.
When you're a business that serves specific customers, it's justifiable to block everyone who isn't your customer. Complaints about overblocking are relevant to public sites, not yours.
> The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.
I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
> I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.
Also: who's your webhost? $1/m sounds like a steal.
While we may be smart, a lot of us are extremely pedantic about tech things. I think for many if they did nothing it would wind them up the wall while doing something the annoyance is smaller.
I disagree with that characterization of the post - merely commenting that a user that could come away with this take has never managed a web-facing service, because you'd immediately see the traffic is immense and constant, especially from crawlers. Sorry if I didn't elaborate that point clearly enough, point taken, I will more carefully craft such responses so such a point isn't misinterpreted or flagged.
I'm sure that would help, yes. Also, there's no need to phrase such a comment in terms of someone else lacking any experience of X - there are too many ways to get that wrong, and even if you're right, it can easily come across as a putdown. If you'd made your point in this case, for example, in terms of your own experience managing a web-facing service, you could have included all the same useful information, if not more!
(One other thing is that the "tell me without telling me" thing is an internet trope and the site guidelines ask people to avoid those - they tend to make for unsubstantive comments, plus they're repetitive and we're trying to avoid that here. But I just mention this for completeness - it's secondary to the other point.)
It's too easy for these things to seem clear and then turn out not to be right at all; moreover there's no need to get personal about these things - it has no benefit and there's an obvious cost.
Again, I would challenge your assertion that this was a personal attack. This comment i responded to, to me, seemed to be coming from a place that has never managed such things on a public facing web interface. it does not seem possible to me to make such a comment without such prior knowledge. I will admit that i did not articulate my comment as such,as sibling comments sufficiently have done, and it probably came off as unnecessarily snarky, and for that I apologize - I do not see it as a personal attack though, at least not on purpose, and dont see it as being flag worthy. but that’s fine, I dont mod here, and dont pretend to know how it is to mod here. so in the future i guess i’ll just avoid such impossible to discern scrutiny if i can.
It's possible the phrase "personal attack" means something a bit different to you than to me, because otherwise I don't think we're really disagreeing. Your good intentions are clear and I appreciate it! We can use a different phrase if you prefer.
I'd just add one other thing: there's one word in your post here which packs a huge amount of meaning and that's seemed (as in "seemed to be coming from a place [etc.]"). I can't tell you how often it happens that what seems one way to one user—even when the "seems" seems overwhelmingly likely, as in near-impossible that it could be any other way—turns out to simply be mistaken, or at least to seem quite opposite to the other person. It's thousands of times easier to make a mistake in this way than people realize; and unfortunately the cost can be quite high when that happens because the other person often feels indignant ("how dare you assume that I [etc.]").
Yes, I've seen this one in our logs. Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic. The bulk of our traffic comes from bots that pretend to be real browsers and that use a large number of IP addresses (mostly from Brazil and Asia in our case).
I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:
* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.
* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).
* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.
* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.
* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.
My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.
Pick an obscure UA substring like MSIE 3.0 or HP-UX. Preemptively 403 these User Agents, (you'll create your own list). Later in the week you can circle back and distill these 403s down to problematic ASNs. Whack moles as necessary.
I've tracked bots that were stuck in a loop no legitimate user would ever get stuck in (basically by circularly following links long past the point of any results). I also decided to filter out what were bots for sure, and it was over a million unique IPs.
I (of course) use the djbwares descendent of Bernstein publicfile. I added a static GEMINI UCSPI-SSL tool to it a while back. One of the ideas that I took from the GEMINI specification and then applied to Bernstein's HTTP server was the prohibition on fragments in request URLs (which the Bernstein original allowed), which I extended to a prohibition on query parameters as well (which the Bernstein original also allowed) in both GEMINI and HTTP.
The reasoning for disallowing them in GEMINI pretty much applies to static HTTP service (which is what publicfile provides) as it does to static GEMINI service. They moreover did not actually work in Bernstein publicfile unless a site administrator went to extraordinary lengths to create multiple oddly-named filenames (non-trivial to handle from a shell on a Unix or Linux-based system, because of the metacharacter) with every possible combination of query parameters, all naming the same file.
Before I introduced this, attempted (and doomed to fail) exploits against weak CGI and PHP scripts were a large fraction of all of the file not found errors that httpd had been logging. These things were getting as far as hitting the filesystem and doing namei lookups. After I introduced this, they are rejected earlier in the transaction, without hitting the filesystem, when the requested URL is decomposed into its constituent parts.
Bernstein publicfile is rather late to this party, as there are over 2 decades of books on the subject of static sites versus dynamic sites (although in fairness it does pre-date all of them). But I can report that the wisdom when it comes to queries holds up even today, in 2025, and if anything a stronger position can be taken on them now.
To those running static sites, I recommend taking this good idea from GEMINI and applying it to query parameters as well.
Unless you are brave enough to actually attempt to provide query parameter support with static site tooling. (-:
I'm always a little surprised to see how many people take robots.txt seriously on HN. It's nice to see so many folks with good intentions.
However, it's obviously not a real solution. It depends on people knowing about it, and adding the complexity of checking it to their crawler. Are there other more serious solutions? It seems like we've heard about "micropayments" and "a big merkle tree of real people" type solutions forever and they've never materialized.
> It depends on people knowing about it, and adding the complexity of checking it to their crawler.
I can't believe any bot writer doesn't know about robots.txt. They're just so self-obsessed and can't comprehend why the rules should apply to them, because obviously their project is special and it's just everyone else's bot that causes trouble.
(malicious) Bot writers have exactly zero concern for robots.txt. Most bots are malicious. Most bots don't set most of the TCP/IP flags. Their only concern is speed. I block about 99% of port scanning bots by simply dropping any TCP SYN packet that is missing MSS or uses a strange value. The most popular port scanning tool is masscan which does not set MSS and some of the malicious user-agents also set some odd MSS values if they even set it at all.
-A PREROUTING -i eth0 -p tcp -m tcp -d $INTERNET_IP --syn -m tcpmss ! --mss 1280:1460 -j DROP
Example rule from the netfilter raw table. This will not help against headless chrome.
The reason this is useful is that many bots first scan for port 443 then try to enumerate it. The bots that look up domain names to scan will still try and many of those come from new certs being created in LetsEncrypt. That is one of the reasons I use the DNS method, get a wildcard and sit on it for a while.
Another thing that helps is setting a default host in ones load balancer or web server that serves up a default simple static page served from a ram disk that say something like, "It Worked!" and disable logging for that default site. In HAProxy one should look up the option "strict-sni". Very old API clients can get blocked if they do not support SNI but along that line most bots are really old unsupported code that the botter could not update if their life depended on it.
You do realize vpns and older connectivity exists that needs values lower than 1280 right?
Of course. Nifty thing about open source means I can configure a system to allow or disallow anything. Each server operator can monitor their legit users traffic and find what they need to allow and dump the rest. Corporate VPN's will be using known values. "Free" VPN's can vary wildly but one need not support them if they choose not to. On some systems I only allow and MSS of 1460 and I also block TCP SYN packets with a TTL greater than 64 but that matches my user-base.
I know crawlies are for sure reading robots.txt because they keep getting themselves banned by my disallowed /honeytrap page which is only advertised there.
What is the commonality between websites severely affected by bots? I run web server from home for years on .com TLD, is high-ish in Google site index for relevant keywords, and do not have any exotic protections against bots either on router or server (though I did make an attempt at counting bots, for curiosity). I get very frequent port scans, and they usually grab the index page, but only rarely follow dynamically-loaded links. I don't even really think about bots because there is no noticeable impact either when I ran server on Apache 2, and now with multiple websites run using Axum.
I would guess directory listing? -But I'm an idiot, so any elucidation would be appreciated.
For my personal site, I let the bots do whatever they want—it's a static site with like 12 pages, so they'd essentially need to saturate the (gigabit) network before causing me any problems.
On the other hand, I had to deploy Anubis for the SVN web interface for tug.org. SVN is way slower than Git (most pages take 5 seconds to load), and the server didn't even have basic caching enabled, but before last year, there weren't any issues. But starting early this year, the bots started scraping every revision, and since the repo is 20+ years old and has 300k files, there are a lot of pages to scrape. This was overloading the entire server, making every other service hosted there unusable. I tried adding caching and blocking some bad ASNs, but Anubis was (unfortunately) the only solution that seems to have worked.
So, I think that the main commonality is popular-ish sites with lots of pages that are computationally-expensive to generate.
One starts to wonder, at what point might it be actually feasible to do it the other way around, by whitelisting IP ranges. I could see this happening as a community effort, similar to adblocker list curation etc.
Unfortunately, well-behaved bots often have more stable IPs, while bad actors are happy to use residential proxies. If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches. Personally I don't think IP level network information will ever be effective without combining with other factors.
Source: stopping attacks that involve thousands of IPs at my work.
Blocking a residential proxy doesn't sound like a bad idea to me.
My single-layer thought process:
If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.
Let's suppose I'm running a residential proxy. Of course my home IP address changes every day, so you'll end up blocking my entire ISP (a major one) or city (a major one) one by one.
And what if I'm behind CGNAT? You will block my entire ISP or city all in one go, and get complaints from a lot of people.
If enough websites block the entire ISP / city in this way, *and* enough users get annoyed by being blocked and switch ISPs, then the ISPs will be motivated to stay in business and police their customers' traffic harder.
Alas, the "enough users get annoyed by being blocked and switch ISPs" step will never happen. Most users only care about the big web properties, and those have the resources to absorb such crawler traffic so they won't get in on the ISP-blocking scheme.
> the ISPs will be motivated to stay in business and police their customers' traffic harder.
You can be completely forgiven if you're speaking from a non-US perspective, but this made me laugh pretty hard -- in this country we usually have a maximum of one broadband ISP available from any one address.
A small fraction of a few of the most populous, mostly East-coast, cities, have fiber and a highly asymmetrical DOCSIS cable option. The rest of the country generally has the cable option (if suburban or higher density) and possibly a complete joke of ADSL (like 6-12Mbps down).
There is nearly zero competition, most customers can choose to either keep their current ISP or switch to something with far worse speed/bandwidth caps/latency, such as cellular internet, or satellite.
One of them won't, but enough of them getting blocked would. People do absolutely notice ISP-level blocks when they happen. We're currently seeing it play out in the UK.
But my main point was in the second paragraph, that "enough of them would" will never happen anyway when the only ones doing the blocking are small websites.
The end user will find out whether their ISP is blocking them or Netflix is blocking them. Usually by asking one of them or by talking to someone who already knows the situation. They will find out Netflix is blocking them, not their ISP.
What, exactly, do you want ISPs to do to police their users from earning $10 of cryptocurrency a month, or even worse, from playing free mobile games? Neither one breaks the law btw. Neither one is even detectable. (Not even by the target website! They're just guessing too)
There are also enough websites that nobody is quitting the internet just because they can't get Netflix. They might subscribe to a different steaming service, or take up torrenting. They'll still keep the internet because it has enough other uses, like Facebook. Switching to a different ISP won't help because it will be every ISP because, as I already said, there's nothing the ISP can do about it. Which, on the other hand, means Netflix would ban every ISP and have zero customers left. Probably not a good business decision.
>The end user will find out whether their ISP is blocking them or Netflix is blocking them. Usually by asking one of them or by talking to someone who already knows the situation. They will find out Netflix is blocking them, not their ISP.
You seem to think I said users will think the block is initiated by the ISP and not the website. I said no such thing so I'm not sure where you got this idea.
>What, exactly, do you want ISPs to do
Respond to abuse reports.
>Neither one is even detectable. (Not even by the target website! They're just guessing too)
TFA has IP addresses.
>Which, on the other hand, means Netflix would ban every ISP and have zero customers left.
It's almost like I already said, twice even, that the plan won't work because the big web properties won't be in on it.
Incorrect. They need to be forbidden from policing traffic this way. Companies like netflix will need to either ban every ISP (and therefore go bankrupt) or cope harder.
> If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches.
Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.
Very likely. You can voluntarily run one to make ~$10/month in cryptocurrency. Many others are botnets. They aren't signing up for new internet connections solely to run proxies on.
The Pokémon Go company tried that shortly after launch to block scraping. I remember they had three categories of IPs:
- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked
- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much
- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.
You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.
I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.
I have an ad hoc system that is similar, comprised of three lists of networks: known good, known bad, and data center networks. These are rate limited using a geo map in nginx for various expensive routes in my application.
The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.
There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...
This really seems like they did everything they could and still got abused by borderline criminal activity from scrapers.
But i do really think it had an impact on scraping, it is just a matter of attrition and raising the cost so it should hurt more to scrape, the problem really never can go away, because at some point the scrapers can just start paying regular users to collect the data.
It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.
I'm pretty sure I still owe t-mobile money. When I moved to the EU, we kept our old phone plans for awhile. Then, for whatever reason, the USD didn't make it to the USD account in time and we missed a payment. Then t-mobile cut off the service and you need to receive a text message to login to the account. Obviously, that wasn't possible. So, we lost the ability to even pay, even while using a VPN. We just decided to let it die, but I'm sure in t-mobile's eyes, I still owe them.
Tricky to get a list of all cloud providers, all their networks, and then there are cases like CATO Networks Ltd and ZScaler, which are apparently enterprise security products that route clients traffic through their clouds "for security".
It's never either/or: you don't have to choose between white and black lists exclusively and most of the traffic is going to come from grey areas anyway.
Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.
Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.
Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.
Knowing my audience, I've blocked entire countries to stop the pain. Even that was a bit of whack-a-mole. Blocking China cooled off the traffic for a few days, then it came roaring back via Singapore. Blocked Singapore, had a reprieve for a while, and then it was India, with a vengeance.
Cloudflare has been a godsend for protecting my crusty old forum from this malicious, wasteful behavior.
The more we avoid terms, the more negative their connotations become, and the more we forget about history.
I would argue, without any evidence, that when terms are used and embraced, they lose their negative connotations. Because in the end, you want to fight the negativity they represent, not the term itself.
Allow/deny list is more descriptive. That's one good reason for using those terms. Do you agree?
In reply to your argument, the deny list (the actual list, apart from what term we use for it) is necessarily something negatively laden, since the items denied are denied due to the real risks/costs they otherwise impose. So using and embracing the less direct phrase 'black' rather than 'deny' in this case seems unlikely to reduce negative connotations from the phrase 'black'.
It really isn’t. It’s a novel term, which implies a functional difference from the common term. Like, I can run around insisting on calling soup food drink because it’s technically more descriptive, that doesn’t mean I’m communicating better.
To the extent we have a bug in our language, it’s probably in describing dark brown skin tones as black. Not a problem with the word black per se. (But again, not a problem really meriting a linguistic overhaul.)
What do the lists do? They allow or deny access, right? Seems allow/deny are fitting descriptive terms for them then. White/black are much more ambiguous prefix terms and and also come with much more semantic baggage. All in all an easy, clarifying change.
> What do the lists do? They allow or deny access, right?
In part. A whitelisted party is always allowed access. If you are whitelisted to enter my home, you always have access. This is different from conditionally having access, or having access for a pre-set period of time.
Same for a blacklist. An IP on a blacklist clearly communicates that it should not be casually overridden in a way a ‘deny-access list’ does not.
> White/black are much more ambiguous prefix terms and and also come with much more semantic baggage
That baggage includes the broadly-understood meaning of the word. When someone says to whitelist an IP address, it’s unambiguous. If someone says to add an IP address to an allow access list, that’s longer and less clear. Inventing a personal language can be an effective way to think through a problem. But it isn’t a way to communicate.
Black and white are colours. (Practically.) I am sympathetic to where folks arguing for this come from. But we aren’t going to solve racism by literally removing black and white from our language.
> different from conditionally having access, or having access for a pre-set period of time.
Irrelevant since the terms allowlist/denylist do not presuppose conditionallity or pre-set time limits.
> If someone says to add an IP address to an allow access list, that’s longer
Allowlist/denylist (9 + 8 chars) is shorter than whitelist/blacklist (9 + 9 chars).
> Inventing a personal language
Sounds like you think the proposal was to invent a whole new language (or one per person)? I would be against that too. But it is really only about updating a technical industry term pair to a more descriptive and less semantically loaded pair. Win-win.
> we aren’t going to solve racism by literally removing black and white from our language.
Changing to allowlist/denylist would not remove the terms black/white from language. There is good reason for making the change that do not involve any claim that doing so would solve racism.
> the terms allowlist/denylist do not presuppose conditionallity or pre-set time limits
They don't pre-suppose anything. They're neologisms. So you have to provide the context when you use them versus being able to leverage what the other person already knows.
> Allowlist/denylist (9 + 8 chars) is shorter than whitelist/blacklist (9 + 9 chars)
The point is you can't just say allow list this block of IPs and walk away in the way saying whitelist these works.
> really only about updating a technical industry term pair to a more descriptive and less semantically loaded pair
Eh, it looks more like creating jargon to signal group membership.
> There is good reason for making the change that do not involve any claim that doing so would solve racism
I guess I'm not seeing it. Black = bad and white = good are deep cultural priors across the world.
Trying to bend a global language like English to accomodate the fact that we've turned those words into racial designations strikes me as silly. (The term blacklist predates [1] the term black as a racial designator, at least in English, I believe by around 100 years [2]. If we want to go pedantic in the opposite direction, no human actually has black or white skin in natural light.)
(For what it’s worth, I’ve genuinely enjoyed this discussion.)
Oh I think they do presuppose a link to the main everyday meaning of the terms allow and deny. To their merit! But yes they do not presuppose conditionality or time-limits.
> versus being able to leverage what the other person already knows
I'd guess over a million people start learning software dev every year without any prior knowledge of these industry terms. In addition while dev terms often have english roots many, maybe even a majority, of new devs are not native english speakers, and for them the other meanings and etymology of whitelist/blacklist might be less familiar and maybe even confusing. In that regard allowlist/denylist have a descriptive advantage, since the main everyday meaning of allow/deny are mnemonic towards their precise technical meaning and when learning lots of new terms every little mnemonic helps to not get overwhelmed.
> you can't just say allow list this block of IPs and walk away in the way saying whitelist these works.
You can once the term is adopted in a context, like a dev team's style guide. More generally there can be a transition period for any industry terminology change to permeate, but after that there'd be no difference in the number of people who already know the exact industry term meaning vs the number who don't. Allowlist/denylist can be used as drop in replacement nouns and verbs. Thereafter the benefit of saving one character per written use of 'denylist' would accumulate forever, as a bonus. I don't know about you but I'm quite used to technical terms regularly getting updated or replaced in software dev and other technical work so this additional proposed change feels like just one more at a tiny transition cost.
> it looks more like creating jargon to signal group membership
I don't think any argument I've given have that as a premise. Cite me if you think otherwise.
> The term blacklist predates
Yep, but I think gains in descriptiveness and avoiding loaded language has higher priority than etymological preservation, in general and in this case.
> Trying to bend a global language like English
You make the proposed industry term pair change sound earthshaking and iconoclastic. To me it is just a small improvement.
Calling soup drink doesn't clarify anything. There's a lot of soup that is not drink. But "allow" vs "white",, "deny" vs "black", one is 100% more descriptive than the other
Arguing that allow/deny or allow/block is less descriptive is basically an argument of "I want things to stay the same because I'm old" or "I like to use jargon because it makes me look smarter and makes sure newbies have a harder time" (and those are the BEST two reasons of all other possibilities)
for those reasons, it's expected that using "black" instead of "deny" will have more support as programmers age and become more reactionary on average, but it doesn't make it any less stupid and racially insensitive
> basically an argument of "I want things to stay the same because I'm old" or "I like to use jargon because it makes me look smarter and makes sure newbies have a harder time"
It’s everyone I need to communicate this to already understands what those terms mean.
Also, white and blacklisting isn’t technical jargon. It’s used across industries, by people day to day and in common media. Allow/deny listing would be jargon, because nobody outside a small circle uses it and thus unambiguously understands what it means.
It's technical jargon in different industries, but it's still jargon, ie. words NOT self explanatory by their normal definitions in mainstream use. Other examples of such terms: "variable", "class"
For the same reason, "allow-list" list is not jargon, just like "component" or "extension"
To me there is one issue only: two syllables vs one (not a problem with block vs black for example but a problem with allow vs white) and that's about it.
Of course it is. If I tell someone to allow list a group of people for an event, that requires further explanation. It’s not self explanatory because it’s non-standard.
> just like "component" or "extension"
If you use them the way they are commonly used, yes. If you repurpose them into a neologism, no. (Most non-acronym jargon involves repurposing common words for a specific context. Glass cockpit. Repo. Server.)
If you tell your friend to put somebody on an allow-list and that requires further explanation, I think the problem is not the term but your friend, sorry...
Server, cockpit those are jargon. Allow and deny just aren't. Whatever.
I understand your point, but my argument is in the more generic aspect.
Consider how whoever complains about blacklist/whitelist would eventually complain about about allow/deny and say they are non-inclusive. Where would this stop?
I would say that as long as the term in unequivocal (and not meant to be offensive) in the context, then there's no need to self-censor
That's an empirical premise in a slippery slope style argument. Any evidence to back it up? Who is opposing the terms allow/deny and why? I don't see it.
> no need to self-censor
The terms allow/deny are more directly descriptive and less contested which I see as a clear win-win change, so I've shifted to use those terms. No biggie and I don't feel self-censored by doing so.
Sometimes I wonder how many lifetimes have been wasted by people all around the world fixing CI because a script expected a branch called master. All for absolutely pointless political correctness theatre.
Not to mention more descriptive. If you hear the term "allowlist" or "denylist" it is immediately obvious and self-explanatory, with no prior context needed.
Leaving aside any other reasons, they're just better names.
No yes no.
As for the post you replied to, allow/deny are indeed the more descriptive terms for lists that allow/deny access. Descriptive terms are good and useful.
>There is no need to disagree on such strongly worded statements.
What's the bigoted history of those terms?
from here[0]:
"The English dramatist Philip Massinger used the phrase "black list" in his 1639 tragedy The Unnatural Combat.[2]
"After the restoration of the English monarchy brought Charles II of England to the throne in 1660, a list of regicides named those to be punished for the execution of his father.[3] The state papers of Charles II say "If any innocent soul be found in this black list, let him not be offended at me, but consider whether some mistaken principle or interest may not have misled him to vote".[4] In a 1676 history of the events leading up to the Restoration, James Heath (a supporter of Charles II) alleged that Parliament had passed an Act requiring the sale of estates, "And into this black list the Earl of Derby was now put, and other unfortunate Royalists".[5]"
Are you an enemy of Charles II? Is that what the problem is?
The origin of the term 'black list' had absolutely nothing to do with the melanin content of anyone. In fact, when that term was coined, it had nothing to do with the melanin content of anyone. It was a list of the enemies of Charles II.
That's why I posted that. I'd also point out that in my lifetime, folks with darker skin called themselves black and proudly so. As Mr. Brown[0][1] will unambiguously tell you. Regardless, claiming that a term for the property of absorbing visible light is bigoted, to every use of such a term is ridiculous on its face.
By your logic, if I wear black socks, I'm a bigot? Or am only a bigot if I actually refer to those socks as "black." Should I use "socks of color" so as not to be a bigot?
If I like that little black dress, I'm a bigot as well? Or only if I say "I like that little black dress?"
Look. I get it. Melanin content is worthless as a determinant of the value of a human. And anyone who thinks otherwise is sorely and sadly mistaken.
It's important to let folks know that there's only one race of sentient primates on this planet -- Homo Sapiens. What's more, we are all, no matter where we come from, incredibly closely related from a genetic standpoint.
The history of bigotry, murder and enslavement by and to our fellow humans is long, brutal and disgusting.
But nitpicking terms (like black list) that never had anything to do with that bigotry seems performative at best. As I mentioned above, do you also make such complaints about black socks or shoes? Black dresses? Black foregrounds/backgrounds?
If not, why not? That's not a rhetorical question.
Not the person you talked to but I'll join in if I may.
I've switched to using allowlist/denylist in computer contexts because more descriptive and less semantically loaded or contested. Easy win-win.
Using 'black' to refer to the color of objects is fine by me.
'Black power!' as a political slogan self-chosen by groups identifying as black is fine too, in contexts where it is used as a tool in work against existing inequalities (various caveats could be added).
As for 'white/black' as terms for entities that are colorless but inherently valenced (e.g. the items designated white are positive and the items designated black are negative, such as risks or costs), I support switching to other terms when not very costly and when newer terms are descriptive and clear. Such as switching to allowlist/denylist in the context of computers.
As for import, I don't think it is a super important change and I don't think the change would make a huge difference in terms of reducing existing racially disproportional negative outcomes in opportunity, wealth, wellbeing and health. It is only a small terminology change that there's some good reason to accept and no good reason to oppose, so I'm on board.
If your shallow (and dismissive) comments along these lines weren't so, well, shallow and dismissive, I might be inclined to put a little more effort into it.
But they're not, so I didn't.
By all means, congratulate yourself for putting this bigoted "culture warrior" in their (obviously) well deserved corner of shame.
I'm not exactly sure how decrying bigotry while pointing out that demanding language unrelated to such bigotry be changed seems performative rather than useful or effective is a "childish culture war provocation."
Perhaps you might ask some folks who actually experience such bigotry how they feel about that. Are there any such folks in your social circle? I'm guessing not, as they'd likely be much more concerned with the actual violence, discrimination and hatred that's being heaped upon them, rather than inane calls for banning technical jargon completely unrelated to that violence and hatred.
It's completely performative and does exactly zero to address the violence and discrimination. Want to help? Demand that police stop assaulting and murdering people of color. Speak out about the completely unjustified hatred and discrimination our fellow humans are subjected to in housing, employment, education, full participation in political life, the criminal "justice" system and a raft of other issues.
But that's too much work for you, right? It's much easier to pay lip service and jump on anyone who doesn't toe the specific lines you set, despite those lines being performative, ineffective and broadly hypocritical.
Want to make a real difference? That's great! Whinging about blacklists vs. denylists in a network routing context isn't going to do that.
Rather it just points at you being a busybody trying to make yourself feel better at the expense of those actively being discriminated against.
And that's why I didn't engage on any reasonable level with you -- because you don't deserve it. For shame!
Or did I miss something important? I am, after all, quite simple minded.
The question you posed above, the question that piqued my interest that I responded to, was
> What's the bigoted history of those terms?
I barely hinted at the bigotry inherent in the creation of a black list by Charles II in response to the bigotry inherent in the execution of Charles I as I was curious as to where your interest lay.
Since then you've ignored the bigotry, ignored the black list in the time of Charles II, imagined and projected all manner of nonsense about my position, etc.
I suspect you're simply ignorant of the actual meaning of the word bigot in the time of Charles I & II, and it's hilarious seeing your overly performative accusations of others being performative.
> Want to help? Demand that police stop assaulting and murdering people of color.
I'm not sure how that has any bearing on the question of the bigotry aspect to the Charles II black list but if it makes you feel any better I was a witness against the police in a Black Deaths in Custody Royal Commission a good many years past.
For your interest:
1661 Cowley Cromwell Wks. II. 655 He was rather a well-meaning and deluding Bigot, than a crafty and malicious Impostor.
1741 Watts Improv. Mind i. Wks. (1813) 14 A dogmatist in religion is not a long way off from a bigot.
1844 Stanley Arnold II. viii. 13 [Dr. Arnold] was almost equally condemned, in London as a bigot, and in Oxford as a latitudinarian.
As we're a long way down a tangential rabbit hole here am I to assume it was yourself who just walked through flagging a run of comments that don't violate guidelines? Either way curiosity and genuine exchanges go further than hyperbolic rhetoric.
More than half of my traffic is Bing, Claude and for whatever reason the Facebook bots.
None of these are main main traffic drivers, just the main resource hogs. And the main reason when my site turns slow (usually an AI, microsoft or Facebook ignoring any common sense)
China and co is only a very small portion of my malicious traffic. Gladly. It's usually US companies who disrespect my robots.txt and DNS rate limits who make me the most problems.
There are a lot of dumb questions, and I pose all of them to Claude. There's no infrastructure in place for this, but I would support some business model where LLM-of-choice compensates website operators for resources consumed by my super dumb questions. Like how content creators get paid when I watch with a YouTube Premium subscription. I doubt this is practical in practice.
For me it looks more like out of the control bots than average requests. For example a few days ago I blocked a few bots. Google was about 600 requests in 24 hours. Bing 1500, Facebook is mostly blocked right now, Claude with 3 different bot types was about 100k requests in the same time.
There is no reason to query all my sub-sites, it's like a search engine with way to many theoretical pages.
Facebook also did aggressively, daily indexing of way to many pages, using large IP ranges until I blocked it. I get like one user per week from them, no idea what they want.
And bing, I learned, "simply" needs hard enforced rate limits it kinda learns to agree on.
There's a recent phishing campaign with sites hosted by Cloudflare and spam sent through either "noobtech.in" (103.173.40.0/24) or through "worldhost.group" (many, many networks).
"noobtech.in" has no web site, can't accept abuse complaints (their email has spam filters), and they don't respond at all to email asking them for better communication methods. The phishing domains have "mail.(phishing domain)" which resolves back to 103.173.40.0/24. Their upstream is a Russian network that doesn't respond to anything. It's 100% clear that this network is only used for phishing and spam.
It's trivial to block "noobtech.in".
"worldhost.group", though, is a huge hosting conglomerate that owns many, many hosting companies and many, many networks spread across many ASNs. They do not respond to any attempts to communicate with them, but since their web site redirects to "hosting.com", I've sent abuse complaints to them. "hosting.com" has autoresponders saying they'll get back to me, but so far not a single ticket has been answered with anything but the initial autoresponder.
It's really, really difficult to imagine how one would block them, and also difficult to imagine what kind of collateral impact that'd have.
These huge providers, Tencent included, get away with way too much. You can't communicate with them, they don't give the slightest shit about harmful, abusive and/or illegal behavior from their networks, and we have no easy way to simply block them.
I think we, collectively, need to start coming up with things we can do that would make their lives difficult enough for them to take notice. Should we have a public listing of all netblocks that belong to such companies and, as an example, we could choose to autorespond to all email from "worldhost.group" and redirect all web browsing from Tencent so we can tell people that their ISP is malicious?
I don't know what the solution is, but I'd love to feel a bit less like I have no recourse when it comes to these huge mega-corporations.
If you block them and they're legitimate, they'll surely find a way to actually start a dialogue. If that feels too harsh you could also start serving captchas and tarpits, but I'm unsure if it's worth actually bothering with.
Externally I use Cloudflare proxy and internally I put Crowdsec and Modsecurity CRS middlewares in front of Traefik.
After some fine-tuning and eliminating false positives, it is running smoothly. It logs all the temporarily banned and reported IPs (to Crowdsec) and logging them to a Discord channel. On average it blocks a few dozen different IPs each day.
From what I see, there are far more American IPs trying to access non-public resources and attempting to exploit CVEs than there are Chinese ones.
I don't really mind anyone scraping publicly accessible content and the rest is either gated by SSO or located in intranet.
For me personally there is no need to block a specific country, I think that trying to block exploit or flooding attempts is a better approach.
You pass all traffic through Cloudflare.
You do not pass any traffic to Crowdsec, you detect locally and only report blocked IPs.
And with Modsecurity CRS you don't report anything to anyone but configuring and fine tuning is a bit harder.
I don't think they are really blocking anything unless you specifically enable it. But it gives some piece of mind knowing that I could probably enable it quickly if it becomes necessary.
Since I posted an article here about using zip bombs [0], I'm flooded with bots. I'm constantly monitoring and tweaking my abuse detector, but this particular bot mentioned in the article seemed to be pointing to an RSS reader. I white listed it at first. But now that I gave it a second look, it's one of the most rampant bot on my blog.
If I had a shady web crawling bot and I implemented a feature for it to avoid zip bombs, I would probably also test it by aggressively crawling a site that is known to protect itself with hand-made zip bombs.
One of the few manual deny-list entries that I have made was not for a Chinese company, but for the ASes of the U.S.A. subsidiary of a Chinese company. It just kept coming back again and again, quite rapidly, for a particular page that was 404. Not for any other related pages, mind. Not for the favicon, robots.txt, or even the enclosing pseudo-directory. Just that 1 page. Over and over.
The directory structure had changed, and the page is now 1 level lower in the tree, correctly hyperlinked long since, in various sitemaps long since, and long since discovered by genuine HTTP clients.
The URL? It now only exists in 1 place on the WWW according to Google. It was posted to Hacker News back in 2017.
(My educated guess is that I am suffering from the page-preloading fallout from repeated robotic scraping of old Hacker News stuff by said U.S.A. subsidiary.)
Out of spite, I'd ignore their request to filter by IP (who knows what their intent is by saying that - maybe they're connecting from VPNs or tor exit nodes to cause disruption etc), but instead filter by matching for that content in the User-Agent instead and feeding them a zip bomb instead.
I run an instance of a downloader tool and had lots of chinese IPs mass-download youtube videos with the most generic UA. I started with „just“ blocking their ASNs, but they always came back with another one until I just decided to stop bothering and banned China entirely.
I‘m confused on why some chinese ISPs have so many different ASNs - while most major internet providers here have exactly one.
I work for IPinfo. If you need IP Address CIDR blocks for any country or ASN, let me know. I have our data in front of me and can send it over via Github Gist. Thank you.
I’ve been having a heck of a time figuring out where some malicious traffic is coming from. Nobody has been able to give me a straight answer when I give them the ip: 127.1.5.12 Maybe you can help trace-a-route to them? I’d just love to know whois behind that IP. If nothing else, I could let them know to be standards compliant and implement rfc 3514.
Is there a list of chinese ASN’s that you can ban if you don’t do much business there - eg all of China, Macau, select Chinese clouds in SE Asia, Polynesia and Africa. I think they’ve kept HK clean so far.
We solved a lot of our problems by blocking all Chinese ASNs. Admittedly, not the friendliest solution, but there were so many issues originating from Chinese clients that it was easier to just ban the entire country.
It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.
There's some weird ones you'd never think of that originate an inordinate amount of bad traffic. Like Seychelles. A tiny little island nation in the middle of the ocean inhabited by... bots apparently? Cyprus is another one.
Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.
There is a Chinese player that has taken effective control of various internet-related entities in the Seychelles. Various ongoing court-cases currently.
So the seychelles traffic is likely really disguised chinese traffic.
I don't think these are "Chinese players" and is linked to [1], although it may be that the hands changed many times that the IP addresses have been leased or bought by Chinese entities.
They're referring to the fact that Chinese game companies (Tencent, Riot through Tencent, etc.) all have executables of varying levels of suspicion (i.e. anti-cheat modules) running in the background on player computers.
Then they're making the claim that those binaries have botnet functionality.
They can exploit local priviledge escalation flaws without "RCE".
And you are right, kernel anti-cheat are rumored to be weaponized by hackers, and making the previous even worse.
And when the kid is playing his/her game at home, if daddy or mummy is a person of interest, they are already on the home LAN...
Well, you get the picture: nowhere to run, orders of magnitude worse than it was before.
Nowadays, the only level of protection the administrator/root access rights give you, is to mitigate any user mistake which would break his/her system... sad...
If you IP block all of China then run a resolver the logs will quickly fill with innocuous domains with NS entries that are blocked. Add those to a dns block list then add their ASN to your company IP block list. Amazing how traffic you don’t want plummets.
The Seychelles has a sweetheart tax deal with India such that a lot of corporations who have an India part and a non-India part will set up a Seychelles corp to funnel cash between the two entities. Through the magic of "Transfer Pricing"[1] they use this to reduce the amount of tax they need to pay.
It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]
So Seychelles may be India-related bots and Cyprus Russia-related bots.
Ignore the trolls. Also, if they are upset with you they should focus their vitriol on me. I block nearly all of BRICS especially Brazil as most are hard wired to not follow even the simplest of rules, most data-centers, some VPN's based on MSS, posting from cell phones and much more. I am always happy to give people the satisfaction of down-voting me since I use uBlock to hide karma.
I’m not claiming everyone pronounces it that way. But he’s an ero, we need to find an ospital, ninety miles an our. You will find government documents and serious newspapers that refer to an hospital.
Likewise, when I was at school, many of my older teachers would say things like "an hotel" although I've not heard anyone say anything but "a hotel" for decades now. I think I've heard "an hospital" relatively recently though.
Weirdly, in certain expressions I say "before mine eyes" even though that fell out of common usage centuries ago, and hasn't really appeared in literature for around a century. So while I wouldn't have encountered it in speech, I've come across enough literary references that it somehow still passed into my diction. I only ever use it for "eyes" though, never anything else starting with a vowel. I also wouldn't use it for something mundane like "My eyes are sore", but I'm not too clear on when or why I use the obsolete form at other times - it just happens!
It all comes down to how the word is pronounced but it's not consistent. 'H' can sound like it's missing on not. Same with other leading consonants that need an 'an'. Some words can go both ways.
My own shitty personal website that is so uninteresting that I do not even wish to disclose here. Hence my lack of understanding of the down-votes for me doing what works for my OWN shitty website, well, server.
In fact, I bet it would choke on a small amount of traffic from here considering it has a shitty vCPU with 512 MB RAM.
Personal sites are definitely interesting, way more interesting than most of the rest of the web.
I was thinking I would put your site into archive.org, using ArchiveBot, with reasonable crawl delay, so that it is preserved if your hardware dies. Ask on the ArchiveTeam IRC if you want that to happen.
It is a public git repository for the most part, that is the essence of my website, not really much writings besides READMEs, comments in code and commits.
A public git repository is even more interesting, for both ArchiveTeam Codearchiver, and Software Heritage. The latter offers an interface for saving code automatically.
After initial save, do they perform automatic git pulls? What happens if there are potential conflicts? I wonder how it all works behind the surface. I know I ran into issues with "git pull --all" before, for example. Or what if it is public software that is not mine? I saved some git repositories (should I do .tar.gz too for the same project? Does it know anything about versions?).
Thanks, appreciate it. I would hope so. I do not care about down-votes per se, my main complaint is really the fact that I am somehow in the wrong for doing what I deem is right for my shitty server(s).
ucloud ("based in HK") has been an issue (much less lately though), and I had to ban the whole digital ocean AS (US). google cloud, aws and microsoft have also some issues...
hostpapa in the US seems to become the new main issue (via what seems a 'ip colocation service'... yes, you read well).
its not weird
.its companies putting themselves in places where regulations favor their business models.
it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.
laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.
and what better place to put your scrapers than somewhere where there is no copyright.
russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.
its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.
id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.
We solved a similar issue by blocking free user traffic from data centres (and whitelisted crawlers for SEO). This eliminated most fraudulent usage over VPNs. Commercial users can still access, but free just users get a prompt to pay.
CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.
If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.
Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
And across the water, my wife has banned US IP addresses from her online shop once or twice. She runs a small business making products that don't travel well, and would cost a lot to ship to the US. It's a huge country with many people. Answering pointless queries, saying "No, I can't do that" in 50 different ways and eventually dealing with negative reviews from people you've never sold to and possibly even never talked to... Much easier to mass block. I call it network segmentation. She's also blocked all of Asia, Africa, Australia and half of Europe.
The blocks don't stay in place forever, just a few months.
Google Shopping might be to blame here, and I don't at all blame the response.
I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.
As long as your customer base never travels and needs support, sure, I guess.
The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.
No, outside the US, both Visa and Mastercard regularly side with the retailer/supplier. If you process a chargeback simply because a UK company blocks your IP, you will be denied.
Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.
I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.
> Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.
Yes, the issuing and acquiring banks perform an arbitration process, and it's generally a very fair process.
We disputed every chargeback and post PSD2 SCA, we won almost all and had a 90%+ net recovery rate. Similar US businesses were lucky to hit 10% and were terrified of chargeback limits.
> I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.
Are you sure? More likely, the vendor didn't dispute the successful chargebacks.
I think you might be talking about "fraudulent transaction/cardholder does not recognize" disputes. Yes, when using 3DS (which is now much more common at least in Europe, due to often being required by regulation in the EU/EEA), these are much less likely to be won by the issuer.
But "merchant does not let me cancel" isn't a fraud dispute (and in fact would probably be lost by the issuing bank if raised as such). Those "non-fraudulent disagreement with the merchant disputes" work very similarly in the US and in Europe.
No, you're just wrong here. Merchant doesn't let me cancel will almost always be won by the vendor when they demonstrate that they do allow cancellations within the bounds of the law and contracts. I've won many of these in the EU, too (we actually never lost a dispute for non-compliance with card network rules, because we were _very_ compliant).
I can only assume you are from the US and are assuming your experience will generalise, but it simply does not. Like night and day. Most EU residents who try using chargebacks for illegitimate dispute resolution learn these lessons quickly, as there are far more card cancellations for "friendly fraud" than merchant account closures for excessive chargebacks in the EU - the polar opposite of the US.
And have you won one of these cases in a scenario where the merchant website has a blanket IP ban? That seems very different from cardholders incapable of clicking an “unsubscribe” button they have access to.
Only via the original method of commerce. An online retailer who geoblocks users does not have to open the geoblock for users who move into the geoblocked regions.
I have first-hand experience, as I ran a company that geoblocked US users for legal reasons and successfully defended chargebacks by users who made transactions in the EU and disputed them from the US.
Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.
> Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.
What's true is that in the US, the cardholder can often just say "I've never heard of that merchant", since 3DS is not really a thing, and generally merchants are relatively unlikely to have compelling evidence to the contrary.
But for all non-fraud disputes, they follow the same process.
As commented elsewhere, you're just wrong. It's a significant burden of proof for a cardholder to win a dispute for non-compliance with card network rules and it very rarely happens (outside of actual merchant fraud, which is much rarer in the EU).
Again, you're not aware of the reality outside the US.
> It's a significant burden of proof for a cardholder to win a dispute for non-compliance with card network rules
That's true, but "fraud" and "compliance" aren't the only dispute categories, not by far.
In this case, using Mastercard as an example (as their dispute rules are public [1]), the dispute category would be "Refund not processed".
The corresponding section explicitly lists this as a valid reason: "The merchant has not responded to the return or the cancellation of goods or services."
> Again, you're not aware of the reality outside the US.
Repeating your incorrect assumption doesn't make it true.
a) a Refund Not Processed chargeback is for non-compliance with card network rules,
and b), When the merchant informed the cardholder of its refund policy at the time of purchase, the cardholder must abide by that policy.
We won these every time, because we had a lawful and compliant refund policy and we stuck to it. These are a complete non-issue for vendors outside the US, unless they are genuinely fraudulent.
Honestly, I think you have no experience with card processors outside the US (or maybe at all) and you just can't admit you're wrong, but anyone with experience would tell you how wrong you are in a heartbeat. The idea you can "defeat" geoblocks with chargebacks is much more likely to result in you losing access to credit than a refund.
Are you even trying to see things from a different perspective, or are you just dead set on winning an argument via ad hominems based on incorrect assumptions about my background?
It's quite possible that both of our experiences are real – at least I'm not trying to cast doubt on yours – but my suspicion is that the generalization you're drawing from yours (i.e. chargeback rules, or at least their practical interpretation, being very different between the US and other countries) isn't accurate.
Both in and outside the US, merchants can and do win chargebacks, but a merchant being completely unresponsive to cancellation requests of future services not yet provided (i.e. not of "buyer's remorse" for a service that's not available to them, per terms and conditions) seems like an easy win for the issuer.
"Visiting the website" is the method. It's nonsense to say that visiting from a different location is a different method. I don't care if you won those disputes, you did a bad thing and screwed over your customers.
> Visiting the website" is the method. It's nonsense to say that visiting from a different location is a different method.
This is a naive view of the internet that does not stand the test of legislative reality. It's perfectly reasonable (and in our case was only path to compliance) to limit access to certain geographic locations.
> I don't care if you won those disputes, you did a bad thing and screwed over your customers.
In our case, our customers were trying to commit friendly fraud by requesting a chargeback because they didn't like a geoblock, which is also what the GP was suggesting.
Using chargebacks this way is nearly unique to the US and thankfully EU banks will deny such frivolous claims.
The ancestor post was about being unable to get support for a product, so I thought you were talking about the same situation. Refusal to support is a legitimate grievance.
Are you saying they tried a chargeback just because they were annoyed at being unable to reach your website? Something doesn't add up here, or am I giving those customers too much credit?
Were you selling them an ongoing website-based service? Then the fair thing would usually be a prorated refund when they change country. A chargeback is bad but keeping all their money while only doing half your job is also bad.
If you read back in the thread, we're talking about the claim that adding geoblocking will result in chargebacks, which outside the US, it won't.
> Are you saying they tried a chargeback just because they were annoyed at being unable to reach your website?
In our case it was friendly fraud when users tried to use a service which we could not provide in the US (and many other countries due to compliance reasons) and had signed up in the EU, possibly via VPN.
What was inaccessible to them: The service itself, or any means to contact the merchant to cancel an ongoing subscription?
I can imagine a merchant to win a chargeback if a customer e.g. signs up for a service using a VPN that isn't actually usable over the same VPN and then wants money for their first month back.
But if cancellation of future charges is also not possible, I'd consider that an instance of a merchant not being responsive to attempts at cancellation, similar to them simply not picking up the phone or responding to emails.
Usually CC companies require email records (another way of communicating with a company) showing you attempted to resolve the problem but could not. I don’t think “I tried to visit the website that I bought X item from while in Africa and couldn’t get to it” is sufficient.
I'm not precisely sure the point you're trying to make.
In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.
Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.
It definitely works, since you’re externalizing your annoyance to people you literally won’t ever hear from because you blanket banned them based. Most of them will just think your site is broken.
This isn't coming from nowhere though. China and Russia don't just randomly happen to have been assigned more bad actors online.
Due to frosty diplomatic relations, there is a deliberate policy to do fuck all to enforce complaints when they come from the west, and at least with Russia, this is used as a means of gray zone cyberwarfare.
China and Russia are being antisocial neighbors. Just like in real life, this does have ramifications for how you are treated.
It seems to be a choice they’re making with their eyes open. If folks running a storefront don’t want to associate with you, it’s not personal in that context. It’s business.
This is a perfectly good solution to many problems, if you are absolutely certain there is no conceivable way your service will be used from some regions.
> Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
Not a problem. Bad actors which are motivated enough to use VPNd or botnets are a different class of attacks that have different types of solutions. If you eliminate 95% of your problems with a single IP filter them you have no good argument to make against it.
Won't help: I get scans and script kiddy hack attempts from digital ocean, microsoft cloud (azure, stretchoid.com), google cloud, aws, and lately "hostpapa" via its 'IP colocation service'. Ofc it is instant fail-to-ban (it is not that hard to perform a basic email delivery to an existing account...).
Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).
Public IP services are done for: going to be hell whatever you do.
The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".
And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...
Worked great for us, but I had to turn it off. Why? Because the IP databases that the two services I was using are not accurate enough and some people in the US were being blocked as if they had a foreign IP address. It happened regularly enough I reluctantly had to turn it off and now I have to deal the non-stop hacking attempts on the website.
For the record, my website is a front end for a local-only business. Absolutely no reason for anyone outside the US to participate.
Okay, but this causes me about 90% of my major annoyances. Seriously. It’s almost always these stupid country restrictions.
I was in UK. I wanted to buy a movie ticket there. Fuck me, because I have an Austrian ip address, because modern mobile backends pass your traffic through your home mobile operator. So I tried to use a VPN. Fuck me, VPN endpoints are blocked also.
I wanted to buy a Belgian train ticket still from home. Cloudflare fuck me, because I’m too suspicious as a foreigner. It broke their whole API access, which was used by their site.
I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too. And of course my bank card… and I just wanted to order a pizza.
The most annoying is when your fucking app is restricted to your stupid country, and I should use it because your app is a public transport app. Lovely.
And of course, there was that time when I moved to an other country… pointless country restrictions everywhere… they really helped.
I remember the times when the saying was that the checkout process should be as frictionless as possible. That sentiment is long gone.
I don’t use VPN generally, only in specific cases. For example, when I want to reach Australian news. Because of course, as a non Australian, I couldn’t care about local news. Or when American pages rather ban Europe than they would tell who they sell my data to.
Not OP, but as far as I know that's how it works, yeah.
When I was in China, using a Chinese SIM had half the internet inaccessible (because China). As I was flying out I swapped my SIM back to my North American one... and even within China I had fully unrestricted (though expensive) access to the entire internet.
I looked into it at the time (now that I had access to non-Chinese internet sites!) and forgot the technical details, but seems that this was how the mobile network works by design. Your provider is responsible for your traffic.
I think a lot of services end up sending you to a sort of generic "not in your country yet!" landing page in an awkward way that can make it hard to "just" get to your account page to do this kind of stuff.
Netflix doesn't have this issue but I've seen services that seem to make it tough. Though sometimes that's just a phone call away.
Though OTOH whining about this and knowing about VPNs and then complaining about the theoretical non-VPN-knower-but-having-subscriptions-to-cancel-and-is-allergic-to-phone-calls-or-calling-their-bank persona... like sure they exist but are we talking about any significant number of people here?
Obligatory side note of "Europe is not a country".
In several European countries, there is no HBO since Sky has some kind of exclusive contract for their content there, and that's where I was accordingly unable to unsubscribe from an US HBO plan.
> Not letting you unsubscribe and blocking your IP are very different things.
When you posted this, what did you envision in your head for how they were prevented from unsubscribing, based on location, but not via IP blocking? I'm really curious.
> Not letting you unsubscribe and blocking your IP are very different things.
How so? They did not let me unsubscribe via blocking my IP.
Instead of being able to access at least my account (if not the streaming service itself, which I get – copyright and all), I'd just see a full screen notice along the lines of "we are not available in your market, stay tuned".
Oddly, my bank has no problem with non-US IPs, but my City's municipal payments site doesn't. I always think it's broken for a moment before realizing I have my VPN turned on.
The percentage of US trips abroad which are to China must be minuscule, and I bet nobody in the US regularly uses a VPN to get a Chinese IP address. So blocking Chinese IP addresses is probably going to have a small impact on US customers. Blocking all abroad IP addresses, on the other hand, would impact people who just travel abroad or use VPNs. Not sure what your point is or why you're comparing these two things.
Not sure I'd call dumping externalities on a minority of your customer base without recourse "capitalism working as intended".
Capitalism is a means to an end, and allowable business practices are a two-way street between corporations and consumers, mediated by regulatory bodies and consumer protection agencies, at least in most functioning democracies.
Maybe, but it doesn't change the fact, that no one is going to forbid me to ban IPs. Therefore I will ban IPs and IPs ranges because it is the cheapest solution.
And usually hackers/malicious actors from that country are not afraid to attack anyone that is not russian, because their local law permits attacking targets in other countries.
(It sometimes comes to funny situations where malware doesn't enable itself on Windows machines if it detects that russian language keyboard is installed.)
Lately I've been thinking that the only viable long-term solution are allowlists instead of blocklists.
The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.
Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.
I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.
This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.
I guess this is what "Identity aware proxy" from GCP can do for you? Outsource all of this to google - where you can connect your own identity servers, and then your service will only be accessed after the identity has been verified.
We have been using that instead of VPN and it has been incredibly nice and performant.
Yeah, I suppose it's something like that. Except that my solution wouldn't rely on Google, would be open source and self-hostable. Are you aware of a similar project that does this? Would save me some time and effort. :)
There also might be similar solutions for other cloud providers or some Kubernetes-adjacent abomination, but I specifically want something generic and standalone.
Lmao I came here to post this. My personal server was making constant hdd grinding noises before I banned the entire nation of China. I only use this server for jellyfin and datahoarding so this was all just logs constantly rolling over from failed ssh auth attempts (PSA: always use public-key, don't allow root, and don't use really obvious usernames like "webadmin" or <literally just the domain>).
Are you familiar with port knocking? My servers will only open port 22, or some other port, after two specific ports have been knocked on in order. It completely eliminates the log files getting clogged.
The bots have been port scanning me for decades. They just don't know which two ports to hit to open 22 for their IP address. Simply iterating won't get then there, and fail2ban doesn't afford them much opportunity to probe.
Did you really notice a significant drop off in connection attempts? I tried this some years ago and after a few hours on a random very high port number I was already seeing connections.
I use a non standard port and have not had an unknown IP hit it in over 25 years. It's not a security feature for me, I use that to avoid noise.
My public SFTP servers are still on port 22 and but block a lot of SSH bots by giving them a long "versionaddendum" /etc/ssh/sshd_config as most of them choke on it. Mine is 720 characters long. Older SSH clients also choke on this so test it first if going this route. Some botters will go out of their way to block me instead so their bots don't hang. One will still see the bots in their logs, but there will be far less messages and far fewer attempts to log in as they will be broken, sticky and confused. Be sure to add offensive words in versionaddendum for the sites that log SSH banners and display them on their web pages like shodan.io.
In my experience can cut out the vast majority of ssh connection attempts by just blocking a couple IPs. ... particularly if you've already disabled password auth because some of the smarter bots notice that and stop trying.
Most of the traffic comes from China and Singapore, so I banned both. I might have to re-check and ban other regions who would never even visit my stupid website anyway. The ones who want to are free to, through VPN. I have not banned them yet.
Naive question: why isn't there a publicly accessible central repository of bad IPs and domains, stewarded by the industry, operated by a nonprofit, like W3C? Yes it wouldn't be enough by itself ("bad" is a very subjective term) but it could be a popular well-maintained baseline.
there are many of these and they are always outdated
another issue is things like cloud hosting will overlap their ranges with legit business ranges happily, so if you go that route you will inadvertently also block legitimate things. not that a regular person care too much for that, but an abuse list should be accurate.
"I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world. If China scrapes content, that's fine as far as the CCP goes; If it's blocked, that's fine by the CCP too (I say, as I adjust my tin foil hat)."
Then turn the tables on them and make the Great Firewall do your job! Just choose a random snippet about illegal Chinese occupation of Tibet or human rights abuses of Uyghur people each time you generate a page and insert it as a breaker between paragraphs. This should get you blocked in no time :)
I just tried this, i took some strings about Falun Gong and the Tianmen thing from the chinese wikipedia and put them into my SSH server banner. The connection attempts from the Tencent AS ceased completely, but now they come from Russia, Lithuania and Iran instead.
Whoa, that's fascinating. So their botnet runs in multiple regions and will auto-switch if one has problems. Makes sense. Seems a bit strange to use China as the primary, though. Unless of course the attacker is based in China? Of the countries you mentioned Lithuania seems a much better choice. They have excellent pipes to EU and North America, and there's no firewall to deal with
FAFO from both sides. Not defending this bot at all. That said, the shenanigans some rogue or clueless webmasters are up to blocking legitimate and non intrusive or load causing M2M trafic is driving some projects into the arms of 'scrape services' that use far less considerate nor ethical means to get to the data you pay them for.
IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
Exactly. If someone can harm your website on accident, they can absolutely harm it on purpose.
If you feel like you need to do anything at all, I would suggest treating it like any other denial-of-service vulnerability: Fix your server or your application. I can handle 100k clients on a single box, which equates to north of 8 billion daily impressions, and so I am happy to ignore bots and identify them offline in a way that doesn't reveal my methodologies any further than I absolutely have to.
> IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
That's traffic I want to block, and that's behaviour that I want to punish / discourage. If a set of users get caught up in that, even when they've just been given recycled IP addresses, then there's more chance to bring the shitty 'scraping as a service' behaviour to light, thus to hopefully disinfect it.
(opinion coming from someone definitely NOT hosting public information that must be accessible by the common populace - that's an issue requiring more nuance, but luckily has public funding behind it to develop nuanced solutions - and can just block China and Russia if it's serving a common populace outside of China and Russia).
Trust me, there's nothing 'nuanced' about the contractor that won the website management contract for the next 6-12 months by being the cheapest bidder for it.
What? Are you trying to say it's legitimate to want to scrape websites that are actively blocking you because you think you are "not intrusive"? And that this justifies paying for bad actors to do it for you?
No. I'm talking about literally legitimate, information that has to be public by law and/or regulation (typically gov stuff), in formats specifically meant for m2m consuption, and still blocked by clueless or malicious outsourced lowest bidder site managers.
And no, I do not use those paid services, even though it would make it much easier.
These IP addresses being released at some point, and making their way into something else is probably the reason I never got to fully run my mailserver from my basement. These companies are just massively giving IP addresses a bad reputation, messing them up for any other use and then abandoning them. I wonder what this would look like when plotted: AI (and other toxic crawling) companies slowly consuming the IPv4 address space? Ideally we'd forced them into some corner of the IPv6 space I guess. I mean robots.txt seems not to be of any help here.
I've mentioned my project[0] before, and it's just as sledgehammer-subtle as this bot asks.
I have a firewall that logs every incoming connection to every port. If I get a connection to a port that has nothing behind it, then I consider the IP address that sent the connection to be malicious, and I block the IP address from connecting to any actual service ports.
This works for me, but I run very few things to serve very few people, so there's minimal collateral damage when 'overblocking' happens - the most common thing is that I lock myself out of my VPN (lolfacepalm).
I occasionally look at the database of IP addresses and do some pivot tabling to find the most common networks and have identified a number of cough security companies that do incessant scanning of the IPv4 internet among other networks that give me the wrong vibes.
To be fair, he was referring to a post on Alex Schroeder's blog titled with the same name as the term from the Dune books. And that post correctly credits Dune/Herbert. But the post is not about Dune, it's about Spam bots so it's more related to what the original author's post is about.
Speaking of the Butlerian Jihad, Frank Herbert's son (Brian) and another author named Kevin J Anderson co-wrote a few books in the Dune universe and one of them was about the Butlerian Jihad. I read it. It was good, not as good at Frank Herbert's books but I still enjoyed it. One of the authors is not as good as the other because you can kind of tell the writing quality changing per chapter.
That's really hard to believe. Brian Herbert's stuff seems sort of Fan Fiction In The World of Dune. Nothing wrong with fan fiction: The Last Ringbearer etc. are pretty enjoyable. But BH just follows on. His work has a bit of the feeling of people who lived in the ruins of the roman forum https://x.com/museiincomune/status/1799039086906474572
Those books completely misrepresent Frank Herbert's original ideas for Butlerian Jihad. It wasn't supposed to be a literal war against genocidal robots.
Interesting to think that the answer to banning thinking computers in Dune was basically to indoctrinate kids from birth (mentats) and/or doing large quantities of drugs (guild navigators).
I feel like people seem to forget that an HTTP request is, after all, a request. When you serve a webpage to a client, you are consenting to that interaction with a voluntary response.
You can blunt instrument 403 geoblock entire countries if you want, or any user agent, or any netblock or ASN. It’s entirely up to you and it’s your own server and nobody will be legitimately mad at you.
You can rate limit IPs to x responses per day or per hour or per week, whatever you like.
This whole AI scraper panic is so incredibly overblown.
I’m currently working on a sniffer that tracks all inbound TCP connections and UDP/ICMP traffic and can trigger firewall rule addition/removal based on traffic attributes (such as firewalling or rate limiting all traffic from certain ASNs or countries) without actually having to be a reverse proxy in the HTTP flow. That way your in-kernel tables don’t need to be huge and they can just dynamically be adjusted from userspace in response to actual observed traffic.
Unfortunately, HN itself is occasionally used for publicising crawling services that rely on underhand techniques that don't seem terribly different to the ones here.
I don't know if its because they operate in the service of capital rather than China, as here, but use of those methods in the former case seems to get more of a pass here.
Any respectable web scale crawler(/scraper) should have reverse DNS so that it can automatically be blocked.
Though it would seem all bets are off and anyone will scrape anything. Now we're left with middlemen like cloudflare that cost people millions of hours of time ticking boxes to prove they're human beings.
> A further check showed that all the network blocks are owned by one organization—Tencent. I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world.
A simple check against the IP address 170.106.176.0, 150.109.96.0, 129.226.160.0, 49.51.166.0 and 43.135.0.0 showed that these IP addresses is allocated to Tencent Cloud, a Google Cloud-like rental service.
I'm using their product personally, it's really cheap, a little more than $12~$20 a year for a VPS, and it's from one of the top Internet company.
Sure, it can't really completely rule out the possibility that Tencent is behind all of this, but I don't really think the communist needs to attack your website through Tencent, it's just simply not logical.
More likely it's just some company rented some server on Tencent crawling the Internet. The rest is probably just your xenophobia fueled paranoia.
You can feed it an ip address to get an AS ("Autonomous System"), then ask it for all prefixes associated with that AS.
I fed it that first ip address from that list (43.131.0.0) and it showed my the same Tencent owned AS132203, and it gives back all the prefixes they have here:
I add re-actively. I figure there are "legitimate" IP's that companies use and I only look at IP addresses that are 'vandalizing' my servers with inappropriate scans and block them.
If I saw the two you have identified, then they would have been added. I do play a balance between "might be a game CDN" or a "legit server" and an outright VPS that is being used to abuse other servers.
But thanks, I will keep an eye on those two ranges.
FWIW, I looked through my list of ~8000 IP addresses, there isn't as many hits for these ranges as I would have thought. It's possible that they're more focused on using known DNS names than simply connecting to 80/443 on random IPs.
Edit: I also checked my Apache logs, I couldn't find any recent logs for "thinkbot".
jep, good tip! for ppl that do this be sure to make it case insensitive and only capture few distinct parts, not too specific. especially if u only expect browsers this can mitigate a lot.
u can also filter for allowing but this gives a risk of allowing the wrong thing as headers are easy to set, so its better to do it via blocking (sadly)
i think there is an opportunity to train an neural network on browser user agent s(they are catalogued but vary and change a lot). then u can block everything not matching.
it will work better than regex.
a lot of these companies rely on 'but we are clearly recognizable' via fornexample these user agents, as excuse to put burden on sysadmins to maintains blocklists instead of otherway round (keep list of scrapables..)
maybe someone mathy can unburden them ?
you could also look who ask for nonexisting resources, and block anyone who asks for more than X (large enough not to let config issue or so kill regular clients). block might be just a minute so u dont have too many risk when an FP occurs. it will be enough likely to make the scraper turn away.
there are many things to do depending on context, app complexity, load etc. , problem is there's no really easy way to do these things.
What exactly do you want to train on a falsifiable piece of info? We do something like this at https://visitorquery.com in order to detect HTTP proxies and VPNs but the UA is very unreliable. I guess you could detect based on multiple pieces with UA being one of them where one UA must have x, y, z or where x cannot be found on one UA. Most of the info is generated tho.
> [Y]ou are responsible for [how] your neighbours [use the Internet].
Nope.
I'm very much not responsible for snooping on my neighbor's private communications. If anyone is responsible for doing any sort of abuse monitoring, it is the ISP chosen by my neighbor.
If your CGNAT IP gets blocked then you are responsible for not complaining to your ISP that they're still doing CGNAT and that someone is being abusive within their network.
This is not a normative social prescription, but a descriptive natural phenomenon.
If there's a neighbour in your building who is running a bitcoin farm on your residential building, it's going to cause issues for you. If people from your country commit crime in other countries and violate visas, then you are going to face a quota due to them. If you bank at ACME Bank, and then it turns out they were arms traffickers, your funds were pooled and helped launder their money, you are responsible by association .
Reputation is not only individual, but there is group reputation, regardless of whether you like it or not.
What ass-backwards jurisdiction do you live in where any of the things you mention in this paragraph are true, let alone the notion that uninvolved bystanders would be responsible for the behavior of others?
Wouldn't it be better, if there's an easy way, to just feed such bots shit data instead of blocking them. I know it's easier to block and saves compute and bandwidth, but perhaps feeding them shit data at scale would be a much better longer term solution.
I recommend you use gzip_static and serve a zip-bomb instead. Frees up the connection sooner and probably causes bad crawlers to exhaust their resources.
I don't think so. The payload size of the bytes on the wire is small. This premise is all dependent on the .zip being crawled synchronously by the same thread/job making the request.
Because it's the single most falsifiable piece of information you would find on ANY "how to scrape for dummies" article out there. They all start with changing your UA.
Sure, but the article is about a bot that expressly identifies itself in the user agent and its user agent name contains a sentence suggesting you block its ip if you don’t like it. Since it uses at least 74 ips, blocking its user agent seems like a fine idea.
I think the UA is easily spoofed, whereas the AS and IP are less easily spoofed. You have everything you need already to spoof UA, while you will need resources to spoof your IP, whether it’s wall clock time to set it up, CPU time to insert another network hop, and/or peers or other third parties to route your traffic, and so on. The User Agent are variables that you can easily change, no real effort or expense or third parties required.
I know opinions are divided on what I am about to mention, but what about CAPTCHA to filter bots? Yes, I am well aware we're a decade past a lot of CAPTCHA being broken by _algorithms_, but I believe it is still a relatively useful general solution, technically -- question is, would we want to filter non-humans, effectively? I am myself on the fence about this, big fan of what HTTP allows us to do, and I mean specifically computer-to-computer (automation/bots/etc) HTTP clients. But with the geopolitical landscape of today, where Internet has become a tug of war (sometimes literally), maybe Butlerian Jihad was onto something? China and Russia are blatantly and near-openly shoving their fingers in every hole they can find, and if this is normalized so will Europe and U.S., for countermeasure (at least one could imagine it being the case). One could also allow bots -- clients unable to solve CAPTCHA -- access to very simplified, distilled and _reduced_ content, to give them the minimal goodwill to "index" and "crawl" for ostensibly "good" purposes.
I don’t understand why people want to block bots, especially from a major player like Tencent, while at the same time doing everything they can to be indexed by Google
I think banning IPs is a treadmill you never really get off of. Between cloud providers, VPNs, CGNAT, and botnets, you spend more time whack-a-moling than actually stopping abuse. What’s worked better for me is tarpitting or just confusing the hell out of scrapers so they waste their own resources.
What I’d really love to see - but probably never will—is companies joining forces to share data or support open projects like Common Crawl. That would raise the floor for everyone. But, you know… capitalism, so instead we all reinvent the wheel in our own silos.
If you can automate the treadmill and set a timeout at which point the 'bad' IPs will go back to being 'not necessarily bad', then you're minimising the effort required.
An open project that classifies and records this - would need a fair bit of on-going protection, ironically.
No, it's really the same thing with just different (and more structured) prefix lengths. In IPv4 you usually block a single /32 address first, then a /24 block, etc. In IPv6 you start with a single /128 address, a single LAN is /64, an entire site is usually /56 (residential) or /48 (company), etc.
Note that for the sake of blocking internet clients, there's no point blocking a /128. Just start at /64. Blocking a /128 is basically useless because of SLAAC.
A /64 is the smallest network on which you can run SLAAC, so almost all VLANs should use this. /56 and /48 for end users is what RIRs are recommending, in reality the prefixes are longer, because ISPs and hosting providers wants you to pay like IPv6 space is some scarse resource.
> Here's how it identifies itself: “Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.)”.
I mean you could just ban the user agent?
The real issue is with bots pretending not to be bots.
This seems like a plot to redirect residential Chinese traffic through VPNs, which are supposedly mostly operated by only a few entities with a stomach for dishonest maneuvering and surveillance.
If a US IP is abusing the internet, you can go through the courts. If a foreign country is... good luck.
So, are hackers and internet shittery coming from China? Block China's ASNs. Too bad ISPs won't do that, so you have to do it yourself. Keep it blocked until China enforces computer fraud and abuse.
It's a simple translation error. They really meant "Feed me worthless synthetic shit at the highest rate you feel comfortable with. It's also OK to tarpit me."
I would never have considered this, but someone on HN pointed out that web user agents work like this. Servers send ads and there is no way for them to enforce that browsers render the ads and hide the content or whatever. The user agent is supposed to act for the user. "Your business model is not my problem", etc.
Well, my user agents work for me, not for you - the server guy who is complaining about this and that. "Your business model is not my problem". Block me if you don't want me.
Well done on pointing out exactly what everyone here is saying in the most arrogant way possible. Also, well done on linking to your own comment where people explain this to you.
The problem is that there is no way to "block me if you don't want me". That's the entire issue. The methods these scrapers use mean it's nigh on impossible to block them.
We block China and Russia. DDOS attacks and other hack attempts went down by 95%.
We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.
How did you choose where to get the IP addresses to block? I guess I'm mostly asking where this problem (i.e. "get all IPs for country X") is on the scale from "obviously solved" to "hard and you need to play catch up constantly".
I did a quick search and found a few databases but none of them looks like the obvious winner.
If you want to test your IP blocks, we have servers on both China and Russia, we can try to take a screenshot from there to see what we get (free, no signup) https://testlocal.ly/
The regulatory burden of conducting business with countries like Russia or China is a critical factor that offhand comments like yours consistently overlook.
It is funny how people immediately jump to conclusions, while I was merely pointing out an circular argument. People immediately think I am jumping to "aid one side". Shows much more about them than me, actually.
My conclusion is: People downvote, but why, when I am merely stating, that the reasoning is circular? Are they unable to get this simple fact? Or are they interpreting more into it, than there is? What is more likely? I choose to believe the second one.
You realize we can't see the scores on other people's posts, right? We have no idea whether or how many downvotes you received. You're talking to people who have no interest in doing business with these countries, so I don't understand how this could be circular reasoning.
Your offhand comment also doesn't make sense in the context of this subthread. The effort companies have to invest to do business with Russia and China is prohibitively high, and that's a completely valid concern. It's not that everyone universally hates or loves these countries. It's simply impractical for most businesses to navigate those markets.
> All of the "blockchain is only for drug dealing and scams" people will sooner or later realize that it is the exact type of scenarios that makes it imperative to keep developing trustless systems.
This is like saying “All the “sugar-sweetened beverages are bad for you” people will sooner or later realize it is imperative to drink liquids”. It is perfectly congruent to believe trustless systems are important and that the way the blockchain works is more harmful than positive.
Additionally, the claim is that cryptocurrencies are used like that. Blockchains by themselves have a different set of issues and criticisms.
I've met and worked with many people who never shilled a coin in their whole life and were treated as criminals for merely proposing any type of application on Ethereum.
I got tired of having people yelling online about how "we are burning the planet" and who refused to understand that proof of stake made energy consumption negligible.
To this day, I have my Mastodon instance on some extreme blocklist because "admin is a crypto shill" and their main evidence was some discussion I was having to use ENS as an alternative to webfinger so that people could own their identity without relying on domain providers.
The goalposts keep moving. The critics will keep finding reasons and workarounds. Lots of useful idiots will keep doubling down on the idea that some holy government will show up and enact perfect regulation, even though it's the institutions themselves who are the most corrupt and taking away their freedoms.
The open, anonymous web is on the verge of extinction. We no longer can keep ignoring externalities. We will need to start designing our systems in a way where everyone will need to either pay or have some form of social proof for accessing remote services. And while this does not require any type of block chains or cryptocurrency, we certainly will need to start showing some respect to all the people who were working on them and have learned a thing or two about these problems.
> and who refused to understand that proof of stake made energy consumption negligible.
Proof of stake brought with it its own set of flaws and failed to solve many of the ones which already existed.
> To this day, I have my Mastodon instance on some extreme blocklist because (…)
Maybe. Or maybe you misinterpreted the reason? I don’t know, I only have your side of the story, so won’t comment either way.
> The goalposts keep moving. The critics will keep finding reasons and workarounds.
As will proponents. Perhaps if initial criticisms had been taken seriously and addressed in a timely manner, there wouldn’t have been reason to thoroughly dismiss the whole field. Or perhaps it would’ve played out exactly the same. None of us know.
> even though it's the institutions themselves who are the most corrupt and taking away their freedoms.
Curious that what is probably the most corrupt administration in the history of the USA, the one actively taking away their citizens’ freedoms as we speak, is the one embracing cryptocurrency to the max. And remember all the times the “immutable” blockchains were reverted because it was convenient to those with the biggest stakes in them? They’re far from impervious to corruption.
> And while this does not require any type of block chains or cryptocurrency, we certainly will need to start showing some respect to all the people who were working on them and have learned a thing or two about these problems.
Er, no. For one, the vast majority of blockchain applications were indeed grifts. It’s unfortunate for the minority who had good intentions, but it is what it is. For another, they didn’t invent the concept of trustless systems and cryptography. The biggest lesson we learned from blockchains is how bad of a solution they are. I don’t feel the need to thank anyone for grabbing an idea, doing it badly, wasting tons of resources while ignoring the needs of the world, using it to scam others, then doubling down on it when presented with the facts of its failings.
> Curious that what is probably the most corrupt administration in the history of the USA, the one actively taking away their citizens’ freedoms as we speak, is the one embracing cryptocurrency to the max.
Your memory is quite selective. El Salvador has been pushing for Bitcoin way before that, so we already have our share of Banana Republic (which is the US is becoming) promoting cryptocurrencies.
Second, the US is "embracing" Bitcoin by backing it up and enabling the creation of off-chain financial instruments. It is a complete corruption and the complete opposite of "trustless systems".
Third, the corruption of the government and their interest in cryptocurrency are orthogonal: the UK is passing bizarre laws to control social media, the EU is pushing for backdoors in messaging systems every other year. None of these institutions are acting with the interests of their citizens at heart, and the more explicit this become the more we will need to have systems that can let us operate trustlessly.
> For another, they didn’t invent the concept of trustless systems and cryptography.
But they are the ones who are actually working and developing practical applications. They are the ones doing actual engineering and dealing with real challenges and solving the problems that people are now facing, such as "how the hell do we deny access to bad actors on the open global internet who have endless resources and have nothing to lose by breaking social norms"?
That read like a bizarre tangent, because it didn’t at all address the argument. To make it clear I’ll repeat the crux of my point, the conclusion that the other arguments lead up to, which you skipped entirely in your reply:
> They’re far from impervious to corruption.
That’s it. That’s the point. You brought up corruption, and I pointed out blockchains don’t actually prevent that. Which you seem to agree with, so I don’t get your response at all.
> But they are the ones who are actually working and developing practical applications.
No, they are not. If no one wants to use them because of all the things they do wrong, they are not practical.
> They are the ones doing actual engineering and dealing with real challenges and solving the problems that people are now facing
No, they are not. They aren’t solving real problems and that is exactly the problem. They are being used almost exclusively for grifts, scams, and hoarding.
> such as "how the hell do we deny access to bad actors on the open global internet who have endless resources and have nothing to lose by breaking social norms"?
> You brought up corruption, and I pointed out blockchains don’t actually prevent that.
No. Let's not talk past each other. My point is not about "preventing corruption". My point is that the citizens can not rely on the current web as an system that works in their favor. My point is that corporations and governments both are using the current web to take away our freedoms, and that we will need systems that do not require trust and/or functional institutions to enforce the rules.
> They are being used almost exclusively for grifts, scams, and hoarding.
"If by whiskey" arguments are really annoying. I am talking about the people doing research in trustless systems. Zero-knowledge proofs. Anonymous transactions. Fraud-proof advertisement impressions.
Scammers, grifters have always existed. Money laundering always existed. And they still happen far more often in the "current" web. There will always be bad actors in any large scale system. My argument is not about "preventing corruption", but to have a system where good actors can act independently even if corruption is prevalent.
> That is not a problem blockchains solve.
Go ahead and try to build a system that keeps access to online resources available to everyone while ensuring that it is cheap for good actors and expensive for bad ones. If you don't want to have any type of blockchain, you will either have to create a whitelist-first network or you will have to rely on an all-powerful entity with policing powers.
is your trigger-word "Ethereum"? he's not even talking about trading crypto or anything you could remotely consider scammy, he's talking about a blockchain based naming system. you're freaking out over nothing, go home man...
I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.
It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.
That's like deterring burglars by hiding your doorbell
> The latest was a slow loris approach where it takes forever for robots.txt to download.
I'd treat this in a client the same way as I do in a server application. If the peer is behaving maliciously or improperly, I silently drop the TCP connection without notifying the other party. They can waste their resources by continuing to send bytes for the next few minutes until their own TCP stack realizes what happens.
How do you silently drop a TCP connection? Closing the socket fd usually results in a FIN packet being sent whether I want it to or not.
Additionally, it's not going to be using that many resources before your kernel sends it a RST next time a data packet is sent
TCP_REPAIR: https://tinselcity.github.io/TCP_Repair/
I doubt that’s on purpose. The bad guys that don’t follow robots don’t bother downloading it.
Never assume malice what can be attributed to incompetence.
Bad guys might download the robots.txt to find out the stuff they don't want them to crawl.
Its likely just a shitty attempt to rate limit bots
I really appreciate you giving a shit. Not sarcastically -- it seems like you're actually doing everything right, and it makes a difference.
Gating robots.txt might be a mistake, but it also might be a quick way to deal with crawlers who mine robots.txt for pages that are more interesting. It's also a page that's never visited by humans. So if you make it a tarpit, you both refuse to give the bot more information and slow it down.
It's crap that it's affecting your work, but a website owner isn't likely to care about the distinction when they're pissed off at having to deal with bad actors that they should never have to care about.
> It's also a page that's never visited by humans.
Never is a strong word. I have definitely visited robots.txt of various websites for a variety of random reasons.
Are you sure you are human?
Yes. I have checked many checkboxes that say "Verify You Are a Human" and they have always confirmed that I am.
In fairness, however, my daughters ask me that question all the time and it is possible that the verification checkboxes are lying to me as part of some grand conspiracy to make me think I am a human when I am not.
https://www.youtube.com/watch?v=4VrLQXR7mKU
--- though I think passing them is more a sign that you're a robot than anything else.
I usually hit robots.txt when I want to make fetch requests to a domain from the console without running into CORS or CSP issues. Since it's just a static file, there's no client-side code interfering, which makes it nice for testing. If you're hunting for vulnerabilities it's also worth probing (especially with crawler UAs), since it can leak hidden endpoints or framework-specific paths that devs didn't expect anyone to notice.
> The latest was a slow loris approach where it takes forever for robots.txt to download
Applying penalties that exclusively hurt people who are trying to be respectful seems counterproductive.
>a slow loris approach
does this refer to the word loris recently and only after several years being added to Wordle™?
No. A slowloris is an existing attack predating Wordle.
No, why would it? The attack was named after the animal, slow loris, many years ago.
would you be interested in writing an article about it ? sounds really interesting
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
I don't think you have any idea how serious the issue is. I was loosely speaking in charge of application-level performance at one job for a web app. I was asked to make the backend as fast as possible at dumping the last byte of HTML back to the user.
The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.
60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.
Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.
There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.
You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.
Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.
> Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times.
Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):
https://www.niss.org/sites/default/files/Tassone_interface6....
One of our customers was paying a third party to hit our website with garbage traffic a couple times a week to make sure we were rejecting malformed requests. I was forever tripping over these in Splunk while trying to look for legitimate problems.
We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.
I guess my position it was comparatively well behaved? There were bots that would full speed blitz the website, for absolutely no reason. You just scraped this page 27 seconds ago, do you really need to check it for an update again? Also it hasn't had a new post in the past 3 years, is it really going to start being lively again?
> they were on links marked nofollow
if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.
It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.
The dumbest part is that we’d known about this for a long time and one day someone discovered we’d implemented a feature toggle to remove those URLs and then it just never got turned on, despite being announced that it had.
They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.
> And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
Googlebot uses different IP space from gcp
They use the same bank accounts and stock ticker. This is basically a non sequitur.
The point is they’re getting paid to run cloud servers to keep their bots happy and not dropping your website to page six.
I thought the argument was that if you run on gcp you can masquerade as googlebot and not get a 429 which is obviously false. Instead it looks like the argument is more of a tinfoil hat variety.
btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?
My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment
every single IPv4 address in existence receives constant malicious traffic, from uncountably many malicious actors, on all common service ports (80, 443, 22, etc.) and, for HTTP specifically, to an enormous and growing number of common endpoints (mostly WordPress related, last I checked)
if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else
you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else
also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible
I was kind of amazed to learn that apparently if you connect Windows NT4/98/2000/ME to a public IPv4 address it gets infected by what is a period correct worm in no time at all. I don't mean that someone uses an RCE to turn it into part of a botnet (that is expected), apparently there are enough infected hosts from 20+ years ago still out there that the sasser worm is still spreading.
I still remember how we installed Windows PCs at home if no media with the latest service pack was available. Install Windows, download service pack, copy it away, disconnect from internet, throw away everything and install Windows again...
I've heard this point raised elsewhere, and I think it's underplaying the magnitude of the issue.
Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.
The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.
AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
So weird to scrape wikipedia when you can just download db dumb from them.
Really makes you think about the calibre of minds being applied to buzzy problem spaces these days, doesn't it?
do we know they didn't download the DB? Maybe the new traffic is the LLM reading the site? (not the training)
I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.
Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.
I would guess site operators can tell the difference between an exhaustive crawl and the targeted specific traffic I'd expect to see from an LLM checking sources on-demand. For one thing, the latter would have time-based patterns attributable to waking hours in the relevant parts of the world, whereas the exhaustive crawl traffic would probably be pretty constant all day and night.
Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.
When you have a pile of funding, and you get told to do things quickly.
But the correct way (getting a sql dump) is faster?
Had to get the web scraper working for other websites.
this is a completely fair point, it may be the case that AI scraper bots have recently made the magnitude and/or details of unwanted bot traffic to public IP addresses much worse
but yeah the issue is that as long as you have something accessible to the public, it's ultimately your responsibility to deal with malicious/aggressive traffic
> At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward
What's worse is when you get bots blasting HTTP traffic at every open port, even well known services like SMTP. Seriously, it's a mail server. It identified itself as soon as the connection was opened, if they waited 100ms-300ms before spamming, they'd know that it wasn't HTTP because the other side wouldn't send anything at all if it was. There's literally no need to bombard a mail server on a well known port by continuing to send a load of junk that's just going to fill someone's log file.
I remember putting dummy GET/PUT/HEAD/POST verbs into SMTP Relay softwares a quarter of a century ago. Attackers do not really save themselves time and money by being intelligent about this. So they aren't.
There are attackers out there that send SIP/2.0 OPTIONS requests to the GOPHER port, over TCP.
It's even funnier when you realize it is a request for a known exploit in WordPress. Does someone really run that on port 22?
I HAVE heard of someone that runs SSH on port 443 and HTTPS on 22.
It blocks a lot of bots, but I feel like just running on a high port number (10,000+) would likely do better.
I have a service running on a high port number on just a straight IPv4 and it does get a bit of bot traffic, but they are generally easy to filter out when looking at logs (well behaved ones have a domain in their User-Agent and bingbot takes my robots.txt into account. I dont think I've seen the Google crawler. Other bots can generally be worked out as anything that didn't request my manifest.json a few seconds after loading the main page)
I have a small public gitea instance that got thousands of requests per hour from bots.
I encountered exactly one actual problem: the temporary folder for zip snapshots filled up the disk since bots followed all snapshot links and it seems gitea doesn't delete generated snapshots. I made that directory read-only, deleted its contents, and the problem was solved, at the cost of only breaking zip snapshots.
I experienced no other problems.
I did put some user-agent checks in place a while later, but that was just for fun to see if AI would eventually ingest false information.
Yes and it makes reading your logs needlessly harder. Sometimes I find an odd password being probed, search for it on the web and find an interesting story, that a new backdoor was discovered in a commercial appliance.
In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.
You log passwords?
I remember back before ssh was a thing folks would log login attempts -- it was easy to get some people's passwords because it was common for them to accidentally use them as the username (which are always safe to log, amirite?). All you had to do was watch for a failed login followed by a successful login from the same IP.
Sure, why not. Log every secret you come across (or that comes across you). Just don't log your own secrets. Like OP said, it lead down some interesting trails.
Just about nobody logs passwords on purpose. But really stupid IoT devices accept credentials as like query strings, or part of the path or something, and it's common to log those. The attacker is sending you passwords meant for a much less secure system.
You probably shouldn't log usernames then, or really any form fields, as users might accidentally enter a password into one of them. Kind of defeats the point of web forms, but safety is important!
Are you using a very weird definition of "logging" to make a joke? Web forms don't need any logging to work.
You save them in a database. Probably in clear text. Six of one, half-dozen of the other.
A password being put into a normal text field in a properly submitted form is a lot less likely than getting into some query or path. And a database is more likely to be handled properly than some random log file.
Six of one, .008 of a dozen of the other.
So no access logs at all then? That sounds effective.
> Sometimes I find an odd password being probed, search for it on the web and find an interesting story [...].
Yeah, this is beyond irresponsible. You know the moment you're pwned, __you__ become the new interesting story?
For everyone else, use a password manager to pick a random password for everything.
What is beyond irresponsible? Monitoring logs and researching odd things found there?
How are passwords ending up in your logs? Something is very, very wrong there.
If the caller puts it in the query string and you log that? It doesn't have to be valid in your application to make an attacker pass it in.
So unless you're not logging your request path/query string you're doing something very very wrong by your own logic :). I can't imagine diagnosing issues with web requests and not be given the path + query string. You can diagnose without but you're sure not making things easier
Does an attacking bot know your webserver is not a misconfigured router exposing its web interface to the net? I often am baffled what conclusions people come up with from half reading posts. I had bots attack me with SSH 2.0 login attempts on port 80 and 443. Some people underestimate how bad at computer science some skids are.
Also baffled that three separate people came to that conclusion. Do they not run web servers on the open web or something? Script kiddies are constantly probing urls, and urls come up in your logs. Sure it would be bad if that was how your app was architected. But it's not how it's architected, it's how the skids hope your app is architected. It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??
> It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??
You're absolutely right. That's my mistake — you are requesting a specific version of WordPress, but I had written a Rails app. I've rewritten the app as a WordPress plugin and deployed it. Let me know if there's anything else I can do for you.
> Do they not run web servers on the open web or something?
Until AI crawlers chased me off of the web, I ran a couple of fairly popular websites. I just so rarely see anybody including passwords in the URLs anymore that I didn't really consider that as what the commenter was talking about.
Just about every crawler that tries probing for wordpress vulnerabilities does this, or includes them in the naked headers as a part of their deluge of requests.
Running ssh on 80 or 443 is a way to get around boneheaded firewalls that allow http(s) but block ssh, so it's not completely insane to see probes for it.
I recall finding weird URLs in my access logs way back when where someone was trying to hit my machine with the CodeRed worm, a full decade after it was new. That was surreal.
The way to handle a password:
Bonus points: on user lookup, when no user is found, fetch a dummy hashedPassword, compare, and ignore the result. This will partially mitigate username enumeration via timing attacks.I believe you may have misinterpreted the comment. They're not talking about logs that were made from a login form on their website. They're talking about generic logs (sometimes not even web server logs) being generated because of bots that are attempting to find vulnerabilities on random pages. Pages that don't even exist or may not even be applicable on this server.
Thousands of requests per hour? So, something like 1-3 per second?
If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.
Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.
Depending on what they're actually pulling down this can get pretty expensive. Bandwidth isn't free.
I love the snark here. I work at a hosting company and the only customers who have issues with crawlers are those who have stupidly slow webpages. It’s hard to have any sympathy for them.
Isn't it part of your job to help them fix that?
How? They are hosting company, not a webshop.
We were only getting 60% of our from bots at my last place because we throttled a bunch of sketchy bots to around 50 simultaneous requests. Which was on the order of 100/s. Our customers were paying for SEO so the bot traffic was a substantial cost of doing business. But as someone tasked with decreasing cluster size I was forever jealous of the large amount of cluster thatwasn’t being seen by humans.
One of the most common issues we helped customers solve when I worked in web hosting was low disk alerts, usually because the log rotation had failed. Often the content of those logs was exactly this sort of nonsense and had spiked recently due to a scraper. The sheer size of the logs can absolutely be a problem on a smaller server, which is more and more common now that the inexpensive server is often a VM or a container.
i usually get 10 a second hitting the same content pages 10 times an hour, is that not what you guys are getting from google bot?
> thousounds of requests an hour from bots
That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.
Serving file content/diff requests from gitea/forgejo is quite expensive computationally. And these bots tend to tarpit themselves when they come across eg. a Linux repo mirror.
https://social.hackerspace.pl/@q3k/114358881508370524
I think at this point every self-hosted forge should block diffs from anonymous users.
Also: Anubis and go-away, but also: some people are on old browsers or underpowered computers.
> Serving file content/diff requests from gitea/forgejo is quite expensive computationally
One time, sure. But unauthenticated requests would surely be cached, authenticated ones skip the cache (just like HN works :) ), as most internet-facing websites end up using this pattern.
There are _lots_ of objects in a large git repository. E.g., I happen to have a fork of VLC lying around. VLC has 70k+ commits (on that version). Each commit has about 10k files. The typical AI crawler wants, for every commit, to download every file (so 700M objects), every tarball (70k+ .tar.gz files), and the blame layer of every file (700M objects, where blame has to look back on average 35k commits). Plus some more.
Saying “just cache this” is not sustainable. And this is only one repository; the only reasonable way to deal with this is some sort of traffic mitigation, you cannot just deal with the traffic as the happy path.
You can't feasibly cache large reposotories' diffs/content-at-version without reimplementing a significant part of git - this stuff is extremely high cardinality and you'd just constantly thrash the cache the moment someone does a BFS/DFS through available links (as these bots tend to do).
We were seeing over a million hits per hour from bots and I agree with GP. It’s fucking out of control. And it’s 100x worse at least if you sell vanity URLs, because the good bots cannot tell that they’re sending you 100 simultaneous requests by throttling on one domain and hitting five others instead.
and this is how the entire web was turned into wordpress slop and cryptoscams
Thousands per hour is 0.3-3 requests per second, which is... not a lot? I host a personal website and it got much more noise before LLMs were even a thing.
The way I get a fast web product is to pay a premium for data. So, no, it's not "lost time" by banning these entities, it's actual saved costs on my bandwidth and compute bills.
The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.
You are paying premium for data? Do you mean for traffic? Sounds like a bad deal to me. The tiniest Hetzner servers give you 20TB included per month. Either you really have lots of traffic, or you are paying for bad hosting deals.
When you're a business that serves specific customers, it's justifiable to block everyone who isn't your customer. Complaints about overblocking are relevant to public sites, not yours.
>an ideological game of capture the flag
I prefer the whack a mole analogy.
I've seen forums where people spend an inordinate amount of time identifying 'bad bots' for blocking, there'll always be more.
> The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.
I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
> I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.
Also: who's your webhost? $1/m sounds like a steal.
You can sometimes find special (loss-leader) deals in this range on LowEndTalk. Typically you'll have to pay upfront for a block of one or two years.
It starting hitting endpoints that do lots of db thrashing, and it’s usually ones that are NOT common or recent so caching won’t save you.
Serving up a page that takes a few dozen db queries is a lot different than serving a static page.
I'd wager a bet that there's a package.json lying somewhere that holds a lot of dependencies
While we may be smart, a lot of us are extremely pedantic about tech things. I think for many if they did nothing it would wind them up the wall while doing something the annoyance is smaller.
[flagged]
Please don't cross into personal attack. You can make your substantive points without that.
I disagree with that characterization of the post - merely commenting that a user that could come away with this take has never managed a web-facing service, because you'd immediately see the traffic is immense and constant, especially from crawlers. Sorry if I didn't elaborate that point clearly enough, point taken, I will more carefully craft such responses so such a point isn't misinterpreted or flagged.
I'm sure that would help, yes. Also, there's no need to phrase such a comment in terms of someone else lacking any experience of X - there are too many ways to get that wrong, and even if you're right, it can easily come across as a putdown. If you'd made your point in this case, for example, in terms of your own experience managing a web-facing service, you could have included all the same useful information, if not more!
(One other thing is that the "tell me without telling me" thing is an internet trope and the site guidelines ask people to avoid those - they tend to make for unsubstantive comments, plus they're repetitive and we're trying to avoid that here. But I just mention this for completeness - it's secondary to the other point.)
It think it's fair play to claim that someone doesn't have relevant experience when it seems very clear that they do not.
It's too easy for these things to seem clear and then turn out not to be right at all; moreover there's no need to get personal about these things - it has no benefit and there's an obvious cost.
Again, I would challenge your assertion that this was a personal attack. This comment i responded to, to me, seemed to be coming from a place that has never managed such things on a public facing web interface. it does not seem possible to me to make such a comment without such prior knowledge. I will admit that i did not articulate my comment as such,as sibling comments sufficiently have done, and it probably came off as unnecessarily snarky, and for that I apologize - I do not see it as a personal attack though, at least not on purpose, and dont see it as being flag worthy. but that’s fine, I dont mod here, and dont pretend to know how it is to mod here. so in the future i guess i’ll just avoid such impossible to discern scrutiny if i can.
It's possible the phrase "personal attack" means something a bit different to you than to me, because otherwise I don't think we're really disagreeing. Your good intentions are clear and I appreciate it! We can use a different phrase if you prefer.
I'd just add one other thing: there's one word in your post here which packs a huge amount of meaning and that's seemed (as in "seemed to be coming from a place [etc.]"). I can't tell you how often it happens that what seems one way to one user—even when the "seems" seems overwhelmingly likely, as in near-impossible that it could be any other way—turns out to simply be mistaken, or at least to seem quite opposite to the other person. It's thousands of times easier to make a mistake in this way than people realize; and unfortunately the cost can be quite high when that happens because the other person often feels indignant ("how dare you assume that I [etc.]").
In the present case, I don't know anything about the experience level of the user who posted https://news.ycombinator.com/item?id=45011628, but https://news.ycombinator.com/item?id=45011442 was definitely posted by someone who has managed heavy-duty web facing services, and that comment says more or less the same thing as the other one.
Yes, I've seen this one in our logs. Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic. The bulk of our traffic comes from bots that pretend to be real browsers and that use a large number of IP addresses (mostly from Brazil and Asia in our case).
I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:
* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.
* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).
* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.
* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.
* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.
My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.
Pick an obscure UA substring like MSIE 3.0 or HP-UX. Preemptively 403 these User Agents, (you'll create your own list). Later in the week you can circle back and distill these 403s down to problematic ASNs. Whack moles as necessary.
I've tracked bots that were stuck in a loop no legitimate user would ever get stuck in (basically by circularly following links long past the point of any results). I also decided to filter out what were bots for sure, and it was over a million unique IPs.
I (of course) use the djbwares descendent of Bernstein publicfile. I added a static GEMINI UCSPI-SSL tool to it a while back. One of the ideas that I took from the GEMINI specification and then applied to Bernstein's HTTP server was the prohibition on fragments in request URLs (which the Bernstein original allowed), which I extended to a prohibition on query parameters as well (which the Bernstein original also allowed) in both GEMINI and HTTP.
* https://geminiprotocol.net/docs/protocol-specification.gmi#r...
The reasoning for disallowing them in GEMINI pretty much applies to static HTTP service (which is what publicfile provides) as it does to static GEMINI service. They moreover did not actually work in Bernstein publicfile unless a site administrator went to extraordinary lengths to create multiple oddly-named filenames (non-trivial to handle from a shell on a Unix or Linux-based system, because of the metacharacter) with every possible combination of query parameters, all naming the same file.
* https://jdebp.uk/Softwares/djbwares/guide/publicfile-securit...
* https://jdebp.uk/Softwares/djbwares/guide/commands/httpd.xml
* https://jdebp.uk/Softwares/djbwares/guide/commands/geminid.x...
Before I introduced this, attempted (and doomed to fail) exploits against weak CGI and PHP scripts were a large fraction of all of the file not found errors that httpd had been logging. These things were getting as far as hitting the filesystem and doing namei lookups. After I introduced this, they are rejected earlier in the transaction, without hitting the filesystem, when the requested URL is decomposed into its constituent parts.
Bernstein publicfile is rather late to this party, as there are over 2 decades of books on the subject of static sites versus dynamic sites (although in fairness it does pre-date all of them). But I can report that the wisdom when it comes to queries holds up even today, in 2025, and if anything a stronger position can be taken on them now.
To those running static sites, I recommend taking this good idea from GEMINI and applying it to query parameters as well.
Unless you are brave enough to actually attempt to provide query parameter support with static site tooling. (-:
I'm always a little surprised to see how many people take robots.txt seriously on HN. It's nice to see so many folks with good intentions.
However, it's obviously not a real solution. It depends on people knowing about it, and adding the complexity of checking it to their crawler. Are there other more serious solutions? It seems like we've heard about "micropayments" and "a big merkle tree of real people" type solutions forever and they've never materialized.
> It depends on people knowing about it, and adding the complexity of checking it to their crawler.
I can't believe any bot writer doesn't know about robots.txt. They're just so self-obsessed and can't comprehend why the rules should apply to them, because obviously their project is special and it's just everyone else's bot that causes trouble.
(malicious) Bot writers have exactly zero concern for robots.txt. Most bots are malicious. Most bots don't set most of the TCP/IP flags. Their only concern is speed. I block about 99% of port scanning bots by simply dropping any TCP SYN packet that is missing MSS or uses a strange value. The most popular port scanning tool is masscan which does not set MSS and some of the malicious user-agents also set some odd MSS values if they even set it at all.
Example rule from the netfilter raw table. This will not help against headless chrome.The reason this is useful is that many bots first scan for port 443 then try to enumerate it. The bots that look up domain names to scan will still try and many of those come from new certs being created in LetsEncrypt. That is one of the reasons I use the DNS method, get a wildcard and sit on it for a while.
Another thing that helps is setting a default host in ones load balancer or web server that serves up a default simple static page served from a ram disk that say something like, "It Worked!" and disable logging for that default site. In HAProxy one should look up the option "strict-sni". Very old API clients can get blocked if they do not support SNI but along that line most bots are really old unsupported code that the botter could not update if their life depended on it.
You do realize vpns and older connectivity exists that needs values lower than 1280 right?
You do realize vpns and older connectivity exists that needs values lower than 1280 right?
Of course. Nifty thing about open source means I can configure a system to allow or disallow anything. Each server operator can monitor their legit users traffic and find what they need to allow and dump the rest. Corporate VPN's will be using known values. "Free" VPN's can vary wildly but one need not support them if they choose not to. On some systems I only allow and MSS of 1460 and I also block TCP SYN packets with a TTL greater than 64 but that matches my user-base.
I know crawlies are for sure reading robots.txt because they keep getting themselves banned by my disallowed /honeytrap page which is only advertised there.
robots.txt isn't the law
I doubt it would help any even if it was.
But in general being an asshole is not a crime.
But if you don't follow robots.txt, Santa won't read your letter.
What is the commonality between websites severely affected by bots? I run web server from home for years on .com TLD, is high-ish in Google site index for relevant keywords, and do not have any exotic protections against bots either on router or server (though I did make an attempt at counting bots, for curiosity). I get very frequent port scans, and they usually grab the index page, but only rarely follow dynamically-loaded links. I don't even really think about bots because there is no noticeable impact either when I ran server on Apache 2, and now with multiple websites run using Axum.
I would guess directory listing? -But I'm an idiot, so any elucidation would be appreciated.
For my personal site, I let the bots do whatever they want—it's a static site with like 12 pages, so they'd essentially need to saturate the (gigabit) network before causing me any problems.
On the other hand, I had to deploy Anubis for the SVN web interface for tug.org. SVN is way slower than Git (most pages take 5 seconds to load), and the server didn't even have basic caching enabled, but before last year, there weren't any issues. But starting early this year, the bots started scraping every revision, and since the repo is 20+ years old and has 300k files, there are a lot of pages to scrape. This was overloading the entire server, making every other service hosted there unusable. I tried adding caching and blocking some bad ASNs, but Anubis was (unfortunately) the only solution that seems to have worked.
So, I think that the main commonality is popular-ish sites with lots of pages that are computationally-expensive to generate.
One starts to wonder, at what point might it be actually feasible to do it the other way around, by whitelisting IP ranges. I could see this happening as a community effort, similar to adblocker list curation etc.
Unfortunately, well-behaved bots often have more stable IPs, while bad actors are happy to use residential proxies. If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches. Personally I don't think IP level network information will ever be effective without combining with other factors.
Source: stopping attacks that involve thousands of IPs at my work.
Blocking a residential proxy doesn't sound like a bad idea to me.
My single-layer thought process:
If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.
Let's suppose I'm running a residential proxy. Of course my home IP address changes every day, so you'll end up blocking my entire ISP (a major one) or city (a major one) one by one.
And what if I'm behind CGNAT? You will block my entire ISP or city all in one go, and get complaints from a lot of people.
If enough websites block the entire ISP / city in this way, *and* enough users get annoyed by being blocked and switch ISPs, then the ISPs will be motivated to stay in business and police their customers' traffic harder.
Alas, the "enough users get annoyed by being blocked and switch ISPs" step will never happen. Most users only care about the big web properties, and those have the resources to absorb such crawler traffic so they won't get in on the ISP-blocking scheme.
> the ISPs will be motivated to stay in business and police their customers' traffic harder.
You can be completely forgiven if you're speaking from a non-US perspective, but this made me laugh pretty hard -- in this country we usually have a maximum of one broadband ISP available from any one address.
A small fraction of a few of the most populous, mostly East-coast, cities, have fiber and a highly asymmetrical DOCSIS cable option. The rest of the country generally has the cable option (if suburban or higher density) and possibly a complete joke of ADSL (like 6-12Mbps down).
There is nearly zero competition, most customers can choose to either keep their current ISP or switch to something with far worse speed/bandwidth caps/latency, such as cellular internet, or satellite.
The hapless end user won't blame the ISP first.
One of them won't, but enough of them getting blocked would. People do absolutely notice ISP-level blocks when they happen. We're currently seeing it play out in the UK.
But my main point was in the second paragraph, that "enough of them would" will never happen anyway when the only ones doing the blocking are small websites.
The end user will find out whether their ISP is blocking them or Netflix is blocking them. Usually by asking one of them or by talking to someone who already knows the situation. They will find out Netflix is blocking them, not their ISP.
What, exactly, do you want ISPs to do to police their users from earning $10 of cryptocurrency a month, or even worse, from playing free mobile games? Neither one breaks the law btw. Neither one is even detectable. (Not even by the target website! They're just guessing too)
There are also enough websites that nobody is quitting the internet just because they can't get Netflix. They might subscribe to a different steaming service, or take up torrenting. They'll still keep the internet because it has enough other uses, like Facebook. Switching to a different ISP won't help because it will be every ISP because, as I already said, there's nothing the ISP can do about it. Which, on the other hand, means Netflix would ban every ISP and have zero customers left. Probably not a good business decision.
>The end user will find out whether their ISP is blocking them or Netflix is blocking them. Usually by asking one of them or by talking to someone who already knows the situation. They will find out Netflix is blocking them, not their ISP.
You seem to think I said users will think the block is initiated by the ISP and not the website. I said no such thing so I'm not sure where you got this idea.
>What, exactly, do you want ISPs to do
Respond to abuse reports.
>Neither one is even detectable. (Not even by the target website! They're just guessing too)
TFA has IP addresses.
>Which, on the other hand, means Netflix would ban every ISP and have zero customers left.
It's almost like I already said, twice even, that the plan won't work because the big web properties won't be in on it.
Indeed. This is why it was important that "net neutrality" not be the law. ISPs need the power to police their user traffic.
It doesn't have anything to do with net neutrality. It's simply a matter of responding to abuse complaints seriously.
Incorrect. They need to be forbidden from policing traffic this way. Companies like netflix will need to either ban every ISP (and therefore go bankrupt) or cope harder.
> If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches.
Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.
In these days of CGNAT, a residential IP is shared by multiple customers.
Very likely. You can voluntarily run one to make ~$10/month in cryptocurrency. Many others are botnets. They aren't signing up for new internet connections solely to run proxies on.
The Pokémon Go company tried that shortly after launch to block scraping. I remember they had three categories of IPs:
- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked
- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much
- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.
You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.
I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.
I have an ad hoc system that is similar, comprised of three lists of networks: known good, known bad, and data center networks. These are rate limited using a geo map in nginx for various expensive routes in my application.
The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.
There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...
And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....
This really seems like they did everything they could and still got abused by borderline criminal activity from scrapers. But i do really think it had an impact on scraping, it is just a matter of attrition and raising the cost so it should hurt more to scrape, the problem really never can go away, because at some point the scrapers can just start paying regular users to collect the data.
Many US companies do it already.
It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.
I'm pretty sure I still owe t-mobile money. When I moved to the EU, we kept our old phone plans for awhile. Then, for whatever reason, the USD didn't make it to the USD account in time and we missed a payment. Then t-mobile cut off the service and you need to receive a text message to login to the account. Obviously, that wasn't possible. So, we lost the ability to even pay, even while using a VPN. We just decided to let it die, but I'm sure in t-mobile's eyes, I still owe them.
This! Dealing with European services from China is also terrible. As is the other way around. Welcome to the intranet!
In addition, my tencent and alicloud instances are also hammered to death by their own bots. Just to add a bit of perspective.
At that point it almost sounds like we're doing "peering" agreements at the IP level.
Would it make sense to have a class of ISPs that didn't peer with these "bad" network participants?
If this didn’t happen for spam, it’s not going to happen for crawlers.
Why not just ban all IP blocks assigned to cloud providers? Won't halt botnets but the IP range owned by AWS, GCP, etc is well known
But my work's VPN is in AWS, and HN and Reddit are sometimes helpful...
Not sure what my point is here tbh. The internet sucks and I don't have a solution
Tricky to get a list of all cloud providers, all their networks, and then there are cases like CATO Networks Ltd and ZScaler, which are apparently enterprise security products that route clients traffic through their clouds "for security".
Because crawlers would then just use a different IP which isn’t owned by cloud vendors.
It's never either/or: you don't have to choose between white and black lists exclusively and most of the traffic is going to come from grey areas anyway.
Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.
Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.
Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.
That's what I'm trying to do here, PRs welcome: https://github.com/AnTheMaker/GoodBots
Noble effort. I might make some pull requests, though I kinda feel it's futile. I have my own list of "known good" networks.
I admin a few local business sites.. I whitelist all the countries isps and the strangeness in the logs and attack counts have gone down.
Google indexes in country, as does a few other search engines..
Would recommend.
Is there a public curated list of "good ips" to whitelist ?
> Is there a public curated list of "good ips" to whitelist ?
https://github.com/AnTheMaker/GoodBots
So, its relatively easy because there is limited ISP's in my country. I imagine its a much harder option for the US.
I looked at all the IP ranges delegated by APNIC, along with every local ISP that I could find, unioned this with
https://lite.ip2location.com/australia-ip-address-ranges
And so far i've not had any complaints. and I think that I have most of them.
At some time in the future, i'll start including https://github.com/ebrasha/cidr-ip-ranges-by-country
Came here to say something similar. The sheer amount of IP addresses one has to block to keep malware and bots at bay is becoming unmanageable.
Knowing my audience, I've blocked entire countries to stop the pain. Even that was a bit of whack-a-mole. Blocking China cooled off the traffic for a few days, then it came roaring back via Singapore. Blocked Singapore, had a reprieve for a while, and then it was India, with a vengeance.
Cloudflare has been a godsend for protecting my crusty old forum from this malicious, wasteful behavior.
Can you explain more about blocking malware as opposed to bots?
No opposition. Just block the IP address.
[flagged]
The more we avoid terms, the more negative their connotations become, and the more we forget about history.
I would argue, without any evidence, that when terms are used and embraced, they lose their negative connotations. Because in the end, you want to fight the negativity they represent, not the term itself.
Allow/deny list is more descriptive. That's one good reason for using those terms. Do you agree?
In reply to your argument, the deny list (the actual list, apart from what term we use for it) is necessarily something negatively laden, since the items denied are denied due to the real risks/costs they otherwise impose. So using and embracing the less direct phrase 'black' rather than 'deny' in this case seems unlikely to reduce negative connotations from the phrase 'black'.
> Allow/deny list is more descriptive
It really isn’t. It’s a novel term, which implies a functional difference from the common term. Like, I can run around insisting on calling soup food drink because it’s technically more descriptive, that doesn’t mean I’m communicating better.
To the extent we have a bug in our language, it’s probably in describing dark brown skin tones as black. Not a problem with the word black per se. (But again, not a problem really meriting a linguistic overhaul.)
> It really isn’t.
What do the lists do? They allow or deny access, right? Seems allow/deny are fitting descriptive terms for them then. White/black are much more ambiguous prefix terms and and also come with much more semantic baggage. All in all an easy, clarifying change.
> What do the lists do? They allow or deny access, right?
In part. A whitelisted party is always allowed access. If you are whitelisted to enter my home, you always have access. This is different from conditionally having access, or having access for a pre-set period of time.
Same for a blacklist. An IP on a blacklist clearly communicates that it should not be casually overridden in a way a ‘deny-access list’ does not.
> White/black are much more ambiguous prefix terms and and also come with much more semantic baggage
That baggage includes the broadly-understood meaning of the word. When someone says to whitelist an IP address, it’s unambiguous. If someone says to add an IP address to an allow access list, that’s longer and less clear. Inventing a personal language can be an effective way to think through a problem. But it isn’t a way to communicate.
Black and white are colours. (Practically.) I am sympathetic to where folks arguing for this come from. But we aren’t going to solve racism by literally removing black and white from our language.
> different from conditionally having access, or having access for a pre-set period of time.
Irrelevant since the terms allowlist/denylist do not presuppose conditionallity or pre-set time limits.
> If someone says to add an IP address to an allow access list, that’s longer
Allowlist/denylist (9 + 8 chars) is shorter than whitelist/blacklist (9 + 9 chars).
> Inventing a personal language
Sounds like you think the proposal was to invent a whole new language (or one per person)? I would be against that too. But it is really only about updating a technical industry term pair to a more descriptive and less semantically loaded pair. Win-win.
> we aren’t going to solve racism by literally removing black and white from our language.
Changing to allowlist/denylist would not remove the terms black/white from language. There is good reason for making the change that do not involve any claim that doing so would solve racism.
> the terms allowlist/denylist do not presuppose conditionallity or pre-set time limits
They don't pre-suppose anything. They're neologisms. So you have to provide the context when you use them versus being able to leverage what the other person already knows.
> Allowlist/denylist (9 + 8 chars) is shorter than whitelist/blacklist (9 + 9 chars)
The point is you can't just say allow list this block of IPs and walk away in the way saying whitelist these works.
> really only about updating a technical industry term pair to a more descriptive and less semantically loaded pair
Eh, it looks more like creating jargon to signal group membership.
> There is good reason for making the change that do not involve any claim that doing so would solve racism
I guess I'm not seeing it. Black = bad and white = good are deep cultural priors across the world.
Trying to bend a global language like English to accomodate the fact that we've turned those words into racial designations strikes me as silly. (The term blacklist predates [1] the term black as a racial designator, at least in English, I believe by around 100 years [2]. If we want to go pedantic in the opposite direction, no human actually has black or white skin in natural light.)
(For what it’s worth, I’ve genuinely enjoyed this discussion.)
[1] https://en.wikipedia.org/wiki/Blacklisting#Origins_of_the_te...
[2] https://nabado.co.ke/2025/01/05/the-origins-and-evolution-of...
> They don't pre-suppose anything
Oh I think they do presuppose a link to the main everyday meaning of the terms allow and deny. To their merit! But yes they do not presuppose conditionality or time-limits.
> versus being able to leverage what the other person already knows
I'd guess over a million people start learning software dev every year without any prior knowledge of these industry terms. In addition while dev terms often have english roots many, maybe even a majority, of new devs are not native english speakers, and for them the other meanings and etymology of whitelist/blacklist might be less familiar and maybe even confusing. In that regard allowlist/denylist have a descriptive advantage, since the main everyday meaning of allow/deny are mnemonic towards their precise technical meaning and when learning lots of new terms every little mnemonic helps to not get overwhelmed.
> you can't just say allow list this block of IPs and walk away in the way saying whitelist these works.
You can once the term is adopted in a context, like a dev team's style guide. More generally there can be a transition period for any industry terminology change to permeate, but after that there'd be no difference in the number of people who already know the exact industry term meaning vs the number who don't. Allowlist/denylist can be used as drop in replacement nouns and verbs. Thereafter the benefit of saving one character per written use of 'denylist' would accumulate forever, as a bonus. I don't know about you but I'm quite used to technical terms regularly getting updated or replaced in software dev and other technical work so this additional proposed change feels like just one more at a tiny transition cost.
> it looks more like creating jargon to signal group membership
I don't think any argument I've given have that as a premise. Cite me if you think otherwise.
> The term blacklist predates
Yep, but I think gains in descriptiveness and avoiding loaded language has higher priority than etymological preservation, in general and in this case.
> Trying to bend a global language like English
You make the proposed industry term pair change sound earthshaking and iconoclastic. To me it is just a small improvement.
Thanks for the discussion!
Calling soup drink doesn't clarify anything. There's a lot of soup that is not drink. But "allow" vs "white",, "deny" vs "black", one is 100% more descriptive than the other
Arguing that allow/deny or allow/block is less descriptive is basically an argument of "I want things to stay the same because I'm old" or "I like to use jargon because it makes me look smarter and makes sure newbies have a harder time" (and those are the BEST two reasons of all other possibilities)
for those reasons, it's expected that using "black" instead of "deny" will have more support as programmers age and become more reactionary on average, but it doesn't make it any less stupid and racially insensitive
> basically an argument of "I want things to stay the same because I'm old" or "I like to use jargon because it makes me look smarter and makes sure newbies have a harder time"
It’s everyone I need to communicate this to already understands what those terms mean.
Also, white and blacklisting isn’t technical jargon. It’s used across industries, by people day to day and in common media. Allow/deny listing would be jargon, because nobody outside a small circle uses it and thus unambiguously understands what it means.
It's technical jargon in different industries, but it's still jargon, ie. words NOT self explanatory by their normal definitions in mainstream use. Other examples of such terms: "variable", "class"
For the same reason, "allow-list" list is not jargon, just like "component" or "extension"
To me there is one issue only: two syllables vs one (not a problem with block vs black for example but a problem with allow vs white) and that's about it.
> "allow-list" list is not jargon
Of course it is. If I tell someone to allow list a group of people for an event, that requires further explanation. It’s not self explanatory because it’s non-standard.
> just like "component" or "extension"
If you use them the way they are commonly used, yes. If you repurpose them into a neologism, no. (Most non-acronym jargon involves repurposing common words for a specific context. Glass cockpit. Repo. Server.)
If you tell your friend to put somebody on an allow-list and that requires further explanation, I think the problem is not the term but your friend, sorry...
Server, cockpit those are jargon. Allow and deny just aren't. Whatever.
I understand your point, but my argument is in the more generic aspect.
Consider how whoever complains about blacklist/whitelist would eventually complain about about allow/deny and say they are non-inclusive. Where would this stop?
I would say that as long as the term in unequivocal (and not meant to be offensive) in the context, then there's no need to self-censor
> would eventually
That's an empirical premise in a slippery slope style argument. Any evidence to back it up? Who is opposing the terms allow/deny and why? I don't see it.
> no need to self-censor
The terms allow/deny are more directly descriptive and less contested which I see as a clear win-win change, so I've shifted to use those terms. No biggie and I don't feel self-censored by doing so.
Do some people just mentally insert the word “people” after every occurrence of the words “black” or “white” they happen across in their daily lives?
And then decide whoever used them had malicious intent?
Is a zebra people with people stripes, or people with people stripes? :)
I doubt it, on both accounts. Neither is needed to prefer allow/deny list though. Malicious intent was not ascribed by the comment you're replying to.
Who cares?
Same people who care about “master” and “main” for hit branches.
Let's ignore those people too. A master branch is just fine, and should offend nobody who has a real life to live.
Sometimes I wonder how many lifetimes have been wasted by people all around the world fixing CI because a script expected a branch called master. All for absolutely pointless political correctness theatre.
Would those people care about the word "hit"?
I think it’s a typo, as I think the context is “git branches” unless you think that “hit branches” makes sense in context. I don’t think it does.
Master branch comes from Latin for expert, authority. master record also comes from that meaning.
Blacklist and whitelist come from black=bad and white=good which if you are black or have empathy is a red flag
[flagged]
[flagged]
Not to mention more descriptive. If you hear the term "allowlist" or "denylist" it is immediately obvious and self-explanatory, with no prior context needed.
Leaving aside any other reasons, they're just better names.
And while you're at it, start speaking Esperanto, use the metric system, and switch to a Dvorak keyboard.
No yes no. As for the post you replied to, allow/deny are indeed the more descriptive terms for lists that allow/deny access. Descriptive terms are good and useful.
Is Dvorak optimized for Esperanto?
Blacklist and whitelist are not antiquated. This is indeed woke, and not useful.
[flagged]
[flagged]
>There is no need to disagree on such strongly worded statements.
What's the bigoted history of those terms?
from here[0]:
"The English dramatist Philip Massinger used the phrase "black list" in his 1639 tragedy The Unnatural Combat.[2]
"After the restoration of the English monarchy brought Charles II of England to the throne in 1660, a list of regicides named those to be punished for the execution of his father.[3] The state papers of Charles II say "If any innocent soul be found in this black list, let him not be offended at me, but consider whether some mistaken principle or interest may not have misled him to vote".[4] In a 1676 history of the events leading up to the Restoration, James Heath (a supporter of Charles II) alleged that Parliament had passed an Act requiring the sale of estates, "And into this black list the Earl of Derby was now put, and other unfortunate Royalists".[5]"
Are you an enemy of Charles II? Is that what the problem is?
[0] https://en.wikipedia.org/wiki/Blacklisting#Origins_of_the_te...
[flagged]
The origin of the term 'black list' had absolutely nothing to do with the melanin content of anyone. In fact, when that term was coined, it had nothing to do with the melanin content of anyone. It was a list of the enemies of Charles II.
That's why I posted that. I'd also point out that in my lifetime, folks with darker skin called themselves black and proudly so. As Mr. Brown[0][1] will unambiguously tell you. Regardless, claiming that a term for the property of absorbing visible light is bigoted, to every use of such a term is ridiculous on its face.
By your logic, if I wear black socks, I'm a bigot? Or am only a bigot if I actually refer to those socks as "black." Should I use "socks of color" so as not to be a bigot?
If I like that little black dress, I'm a bigot as well? Or only if I say "I like that little black dress?"
Look. I get it. Melanin content is worthless as a determinant of the value of a human. And anyone who thinks otherwise is sorely and sadly mistaken.
It's important to let folks know that there's only one race of sentient primates on this planet -- Homo Sapiens. What's more, we are all, no matter where we come from, incredibly closely related from a genetic standpoint.
The history of bigotry, murder and enslavement by and to our fellow humans is long, brutal and disgusting.
But nitpicking terms (like black list) that never had anything to do with that bigotry seems performative at best. As I mentioned above, do you also make such complaints about black socks or shoes? Black dresses? Black foregrounds/backgrounds?
If not, why not? That's not a rhetorical question.
[0] https://www.youtube.com/watch?v=oM1_tJ6a2Kw
[1] https://www.azlyrics.com/lyrics/jamesbrown/sayitloudimblacka...
[flagged]
I'll have the black pudding.
My cat has a black tail.
The top of my desk is black.
I have several pairs of black shoes.
Every single computer in my possession has a black case.
My phone and its case are both black.
Black Power![0][1][2]
I will put you on my personal blacklist.
Which I'm sure you won't mind since I'm a huge bigot, right?
[0] https://www.britannica.com/topic/Black-Power-Movement
[1] https://en.wikipedia.org/wiki/Black_power_movement
[2] https://www.oed.com/dictionary/black-power_n?tl=true
Not the person you talked to but I'll join in if I may.
I've switched to using allowlist/denylist in computer contexts because more descriptive and less semantically loaded or contested. Easy win-win.
Using 'black' to refer to the color of objects is fine by me.
'Black power!' as a political slogan self-chosen by groups identifying as black is fine too, in contexts where it is used as a tool in work against existing inequalities (various caveats could be added).
As for 'white/black' as terms for entities that are colorless but inherently valenced (e.g. the items designated white are positive and the items designated black are negative, such as risks or costs), I support switching to other terms when not very costly and when newer terms are descriptive and clear. Such as switching to allowlist/denylist in the context of computers.
As for import, I don't think it is a super important change and I don't think the change would make a huge difference in terms of reducing existing racially disproportional negative outcomes in opportunity, wealth, wellbeing and health. It is only a small terminology change that there's some good reason to accept and no good reason to oppose, so I'm on board.
[flagged]
If your shallow (and dismissive) comments along these lines weren't so, well, shallow and dismissive, I might be inclined to put a little more effort into it.
But they're not, so I didn't.
By all means, congratulate yourself for putting this bigoted "culture warrior" in their (obviously) well deserved corner of shame.
I'm not exactly sure how decrying bigotry while pointing out that demanding language unrelated to such bigotry be changed seems performative rather than useful or effective is a "childish culture war provocation."
Perhaps you might ask some folks who actually experience such bigotry how they feel about that. Are there any such folks in your social circle? I'm guessing not, as they'd likely be much more concerned with the actual violence, discrimination and hatred that's being heaped upon them, rather than inane calls for banning technical jargon completely unrelated to that violence and hatred.
It's completely performative and does exactly zero to address the violence and discrimination. Want to help? Demand that police stop assaulting and murdering people of color. Speak out about the completely unjustified hatred and discrimination our fellow humans are subjected to in housing, employment, education, full participation in political life, the criminal "justice" system and a raft of other issues.
But that's too much work for you, right? It's much easier to pay lip service and jump on anyone who doesn't toe the specific lines you set, despite those lines being performative, ineffective and broadly hypocritical.
Want to make a real difference? That's great! Whinging about blacklists vs. denylists in a network routing context isn't going to do that.
Rather it just points at you being a busybody trying to make yourself feel better at the expense of those actively being discriminated against.
And that's why I didn't engage on any reasonable level with you -- because you don't deserve it. For shame!
Or did I miss something important? I am, after all, quite simple minded.
Perhaps you could explain it to me?
> Or did I miss something important?
Pretty much.
The question you posed above, the question that piqued my interest that I responded to, was
> What's the bigoted history of those terms?
I barely hinted at the bigotry inherent in the creation of a black list by Charles II in response to the bigotry inherent in the execution of Charles I as I was curious as to where your interest lay.
Since then you've ignored the bigotry, ignored the black list in the time of Charles II, imagined and projected all manner of nonsense about my position, etc.
I suspect you're simply ignorant of the actual meaning of the word bigot in the time of Charles I & II, and it's hilarious seeing your overly performative accusations of others being performative.
> Want to help? Demand that police stop assaulting and murdering people of color.
I'm not sure how that has any bearing on the question of the bigotry aspect to the Charles II black list but if it makes you feel any better I was a witness against the police in a Black Deaths in Custody Royal Commission a good many years past.
For your interest:
As we're a long way down a tangential rabbit hole here am I to assume it was yourself who just walked through flagging a run of comments that don't violate guidelines? Either way curiosity and genuine exchanges go further than hyperbolic rhetoric.More than half of my traffic is Bing, Claude and for whatever reason the Facebook bots.
None of these are main main traffic drivers, just the main resource hogs. And the main reason when my site turns slow (usually an AI, microsoft or Facebook ignoring any common sense)
China and co is only a very small portion of my malicious traffic. Gladly. It's usually US companies who disrespect my robots.txt and DNS rate limits who make me the most problems.
There are a lot of dumb questions, and I pose all of them to Claude. There's no infrastructure in place for this, but I would support some business model where LLM-of-choice compensates website operators for resources consumed by my super dumb questions. Like how content creators get paid when I watch with a YouTube Premium subscription. I doubt this is practical in practice.
For me it looks more like out of the control bots than average requests. For example a few days ago I blocked a few bots. Google was about 600 requests in 24 hours. Bing 1500, Facebook is mostly blocked right now, Claude with 3 different bot types was about 100k requests in the same time.
There is no reason to query all my sub-sites, it's like a search engine with way to many theoretical pages.
Facebook also did aggressively, daily indexing of way to many pages, using large IP ranges until I blocked it. I get like one user per week from them, no idea what they want.
And bing, I learned, "simply" needs hard enforced rate limits it kinda learns to agree on.
This is a problem.
There's a recent phishing campaign with sites hosted by Cloudflare and spam sent through either "noobtech.in" (103.173.40.0/24) or through "worldhost.group" (many, many networks).
"noobtech.in" has no web site, can't accept abuse complaints (their email has spam filters), and they don't respond at all to email asking them for better communication methods. The phishing domains have "mail.(phishing domain)" which resolves back to 103.173.40.0/24. Their upstream is a Russian network that doesn't respond to anything. It's 100% clear that this network is only used for phishing and spam.
It's trivial to block "noobtech.in".
"worldhost.group", though, is a huge hosting conglomerate that owns many, many hosting companies and many, many networks spread across many ASNs. They do not respond to any attempts to communicate with them, but since their web site redirects to "hosting.com", I've sent abuse complaints to them. "hosting.com" has autoresponders saying they'll get back to me, but so far not a single ticket has been answered with anything but the initial autoresponder.
It's really, really difficult to imagine how one would block them, and also difficult to imagine what kind of collateral impact that'd have.
These huge providers, Tencent included, get away with way too much. You can't communicate with them, they don't give the slightest shit about harmful, abusive and/or illegal behavior from their networks, and we have no easy way to simply block them.
I think we, collectively, need to start coming up with things we can do that would make their lives difficult enough for them to take notice. Should we have a public listing of all netblocks that belong to such companies and, as an example, we could choose to autorespond to all email from "worldhost.group" and redirect all web browsing from Tencent so we can tell people that their ISP is malicious?
I don't know what the solution is, but I'd love to feel a bit less like I have no recourse when it comes to these huge mega-corporations.
Could you drop a message to dom@ with more details and I'll get this stopped from the WHG side - and find out what happened. Thanks!
If you block them and they're legitimate, they'll surely find a way to actually start a dialogue. If that feels too harsh you could also start serving captchas and tarpits, but I'm unsure if it's worth actually bothering with.
Externally I use Cloudflare proxy and internally I put Crowdsec and Modsecurity CRS middlewares in front of Traefik.
After some fine-tuning and eliminating false positives, it is running smoothly. It logs all the temporarily banned and reported IPs (to Crowdsec) and logging them to a Discord channel. On average it blocks a few dozen different IPs each day.
From what I see, there are far more American IPs trying to access non-public resources and attempting to exploit CVEs than there are Chinese ones.
I don't really mind anyone scraping publicly accessible content and the rest is either gated by SSO or located in intranet.
For me personally there is no need to block a specific country, I think that trying to block exploit or flooding attempts is a better approach.
Crowdsec: the idea is tempting, but giving away all of the server's traffic to a for-profit is a huge liability.
You pass all traffic through Cloudflare. You do not pass any traffic to Crowdsec, you detect locally and only report blocked IPs. And with Modsecurity CRS you don't report anything to anyone but configuring and fine tuning is a bit harder.
The more egregious attempts are likely being blocked by Cloudflare WAF / similar.
I don't think they are really blocking anything unless you specifically enable it. But it gives some piece of mind knowing that I could probably enable it quickly if it becomes necessary.
Since I posted an article here about using zip bombs [0], I'm flooded with bots. I'm constantly monitoring and tweaking my abuse detector, but this particular bot mentioned in the article seemed to be pointing to an RSS reader. I white listed it at first. But now that I gave it a second look, it's one of the most rampant bot on my blog.
[0]: https://news.ycombinator.com/item?id=43826798
If I had a shady web crawling bot and I implemented a feature for it to avoid zip bombs, I would probably also test it by aggressively crawling a site that is known to protect itself with hand-made zip bombs.
also protect yourself fromnsucking up fake generated content. i know some folks here like to feed them all sorts of 'data' . fun stuff :D
One of the few manual deny-list entries that I have made was not for a Chinese company, but for the ASes of the U.S.A. subsidiary of a Chinese company. It just kept coming back again and again, quite rapidly, for a particular page that was 404. Not for any other related pages, mind. Not for the favicon, robots.txt, or even the enclosing pseudo-directory. Just that 1 page. Over and over.
The directory structure had changed, and the page is now 1 level lower in the tree, correctly hyperlinked long since, in various sitemaps long since, and long since discovered by genuine HTTP clients.
The URL? It now only exists in 1 place on the WWW according to Google. It was posted to Hacker News back in 2017.
(My educated guess is that I am suffering from the page-preloading fallout from repeated robotic scraping of old Hacker News stuff by said U.S.A. subsidiary.)
Rule number one: You do not talk about fight club.
Dark forest theory taking root.
Out of spite, I'd ignore their request to filter by IP (who knows what their intent is by saying that - maybe they're connecting from VPNs or tor exit nodes to cause disruption etc), but instead filter by matching for that content in the User-Agent instead and feeding them a zip bomb instead.
I have the client send a custom header with every request, and block all other request.
I run an instance of a downloader tool and had lots of chinese IPs mass-download youtube videos with the most generic UA. I started with „just“ blocking their ASNs, but they always came back with another one until I just decided to stop bothering and banned China entirely. I‘m confused on why some chinese ISPs have so many different ASNs - while most major internet providers here have exactly one.
I work for IPinfo. If you need IP Address CIDR blocks for any country or ASN, let me know. I have our data in front of me and can send it over via Github Gist. Thank you.
I’ve been having a heck of a time figuring out where some malicious traffic is coming from. Nobody has been able to give me a straight answer when I give them the ip: 127.1.5.12 Maybe you can help trace-a-route to them? I’d just love to know whois behind that IP. If nothing else, I could let them know to be standards compliant and implement rfc 3514.
the malicious traffic is coming from inside the house
Is there a list of chinese ASN’s that you can ban if you don’t do much business there - eg all of China, Macau, select Chinese clouds in SE Asia, Polynesia and Africa. I think they’ve kept HK clean so far.
We solved a lot of our problems by blocking all Chinese ASNs. Admittedly, not the friendliest solution, but there were so many issues originating from Chinese clients that it was easier to just ban the entire country.
It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.
There's some weird ones you'd never think of that originate an inordinate amount of bad traffic. Like Seychelles. A tiny little island nation in the middle of the ocean inhabited by... bots apparently? Cyprus is another one.
Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.
There is a Chinese player that has taken effective control of various internet-related entities in the Seychelles. Various ongoing court-cases currently.
So the seychelles traffic is likely really disguised chinese traffic.
I don't think these are "Chinese players" and is linked to [1], although it may be that the hands changed many times that the IP addresses have been leased or bought by Chinese entities.
[1] https://mybroadband.co.za/news/internet/350973-man-connected...
Interesting: https://techafricanews.com/2025/07/24/smart-africa-calls-for...
I forgot about that: all the nice game binaries from them running directly on nearly all systems...
Huh? Who is them in this case?
They're referring to the fact that Chinese game companies (Tencent, Riot through Tencent, etc.) all have executables of varying levels of suspicion (i.e. anti-cheat modules) running in the background on player computers.
Then they're making the claim that those binaries have botnet functionality.
They can exploit local priviledge escalation flaws without "RCE".
And you are right, kernel anti-cheat are rumored to be weaponized by hackers, and making the previous even worse.
And when the kid is playing his/her game at home, if daddy or mummy is a person of interest, they are already on the home LAN...
Well, you get the picture: nowhere to run, orders of magnitude worse than it was before.
Nowadays, the only level of protection the administrator/root access rights give you, is to mitigate any user mistake which would break his/her system... sad...
this all from Cloud Innovation vpns,proxies,spam,bots CN Seychelles IP holder
omg... that's why my self-hosted servers are getting nasty trafic from SC all the time.
The explanation is that easy??
> So the seychelles traffic is likely really disguised chinese traffic.
Soon: chineseplayer.io
If you IP block all of China then run a resolver the logs will quickly fill with innocuous domains with NS entries that are blocked. Add those to a dns block list then add their ASN to your company IP block list. Amazing how traffic you don’t want plummets.
The Seychelles has a sweetheart tax deal with India such that a lot of corporations who have an India part and a non-India part will set up a Seychelles corp to funnel cash between the two entities. Through the magic of "Transfer Pricing"[1] they use this to reduce the amount of tax they need to pay.
It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]
So Seychelles may be India-related bots and Cyprus Russia-related bots.
[1] https://taxjustice.net/faq/what-is-transfer-pricing/#:~:text...
[2] Yup. My memory originated in the "Panama Papers" leaks https://www.icij.org/investigations/cyprus-confidential/cypr...
[flagged]
Ignore the trolls. Also, if they are upset with you they should focus their vitriol on me. I block nearly all of BRICS especially Brazil as most are hard wired to not follow even the simplest of rules, most data-centers, some VPN's based on MSS, posting from cell phones and much more. I am always happy to give people the satisfaction of down-voting me since I use uBlock to hide karma.
In ublock -> my filters
Those are nice filters, I checked out your profile too!
Thank you. :)
People get weird when you do what you want with your own things.
Want to block an entire country from your site? Sure, it’s your site. Is it fair? Doesn’t matter.
> be a hero and die a martyr
I believe it's "an hero".
Oh thank you kind sir.
Uh, no, it's definitely not. Hero begins with a consonant, so it should be preceded by "a", not "an".
https://knowyourmeme.com/memes/an-hero
Welcome to British English. The h in hero isn’t pronounced, same as hospital, so you use an before it.
This is wrong.
Unless maybe you're from the east end of london.
I’m not claiming everyone pronounces it that way. But he’s an ero, we need to find an ospital, ninety miles an our. You will find government documents and serious newspapers that refer to an hospital.
Generic American English pronounces the 'h' in hospital, hero, heroine, but not hour.
Same is true for RP English.
Therefore, for both accents/dialects, the correct phrases are "a hotel", "a hero", "a heroine", and "an hour".
Cockney, West Country, and a few other English accents "h drop" and would use "an 'our", "an 'otel", etc.
> RP English
One might think RP English certainly doesn't determine correctness.
Now do historic. Suddenly all Brits turn into Cockney's.
Sure, and all Americans sound like they're from Ocracoke or Tangier.
Likewise, when I was at school, many of my older teachers would say things like "an hotel" although I've not heard anyone say anything but "a hotel" for decades now. I think I've heard "an hospital" relatively recently though.
Weirdly, in certain expressions I say "before mine eyes" even though that fell out of common usage centuries ago, and hasn't really appeared in literature for around a century. So while I wouldn't have encountered it in speech, I've come across enough literary references that it somehow still passed into my diction. I only ever use it for "eyes" though, never anything else starting with a vowel. I also wouldn't use it for something mundane like "My eyes are sore", but I'm not too clear on when or why I use the obsolete form at other times - it just happens!
That's not right. It's:
a hospital
an hour
a horse
It all comes down to how the word is pronounced but it's not consistent. 'H' can sound like it's missing on not. Same with other leading consonants that need an 'an'. Some words can go both ways.
I was thinking of 'hotel'. Wrong building. Ooops.
Which site is it?
My own shitty personal website that is so uninteresting that I do not even wish to disclose here. Hence my lack of understanding of the down-votes for me doing what works for my OWN shitty website, well, server.
In fact, I bet it would choke on a small amount of traffic from here considering it has a shitty vCPU with 512 MB RAM.
Personal sites are definitely interesting, way more interesting than most of the rest of the web.
I was thinking I would put your site into archive.org, using ArchiveBot, with reasonable crawl delay, so that it is preserved if your hardware dies. Ask on the ArchiveTeam IRC if you want that to happen.
https://chat.hackint.org/?join=%23archiveteam-bs
It is a public git repository for the most part, that is the essence of my website, not really much writings besides READMEs, comments in code and commits.
A public git repository is even more interesting, for both ArchiveTeam Codearchiver, and Software Heritage. The latter offers an interface for saving code automatically.
https://wiki.archiveteam.org/index.php/Codearchiver https://wiki.archiveteam.org/index.php/Software_Heritage https://archive.softwareheritage.org/save/
After initial save, do they perform automatic git pulls? What happens if there are potential conflicts? I wonder how it all works behind the surface. I know I ran into issues with "git pull --all" before, for example. Or what if it is public software that is not mine? I saved some git repositories (should I do .tar.gz too for the same project? Does it know anything about versions?).
[flagged]
Thanks, appreciate it. I would hope so. I do not care about down-votes per se, my main complaint is really the fact that I am somehow in the wrong for doing what I deem is right for my shitty server(s).
ucloud ("based in HK") has been an issue (much less lately though), and I had to ban the whole digital ocean AS (US). google cloud, aws and microsoft have also some issues...
hostpapa in the US seems to become the new main issue (via what seems a 'ip colocation service'... yes, you read well).
its not weird .its companies putting themselves in places where regulations favor their business models.
it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.
laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.
and what better place to put your scrapers than somewhere where there is no copyright.
russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.
its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.
id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.
We solved a similar issue by blocking free user traffic from data centres (and whitelisted crawlers for SEO). This eliminated most fraudulent usage over VPNs. Commercial users can still access, but free just users get a prompt to pay.
CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.
Why stop there? Just block all non-US IPs!
If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.
Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
And across the water, my wife has banned US IP addresses from her online shop once or twice. She runs a small business making products that don't travel well, and would cost a lot to ship to the US. It's a huge country with many people. Answering pointless queries, saying "No, I can't do that" in 50 different ways and eventually dealing with negative reviews from people you've never sold to and possibly even never talked to... Much easier to mass block. I call it network segmentation. She's also blocked all of Asia, Africa, Australia and half of Europe.
The blocks don't stay in place forever, just a few months.
Google Shopping might be to blame here, and I don't at all blame the response.
I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.
As long as your customer base never travels and needs support, sure, I guess.
The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.
Chargebacks aren't the panacea you're used to outside the US, so that's a non-issue.
Only if your bank isn't competent in using them.
Visa/Mastercard chargeback rules largely apply worldwide (with some regional exceptions, but much less than many banks would make you believe).
No, outside the US, both Visa and Mastercard regularly side with the retailer/supplier. If you process a chargeback simply because a UK company blocks your IP, you will be denied.
Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.
I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.
> Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.
Yes, the issuing and acquiring banks perform an arbitration process, and it's generally a very fair process.
We disputed every chargeback and post PSD2 SCA, we won almost all and had a 90%+ net recovery rate. Similar US businesses were lucky to hit 10% and were terrified of chargeback limits.
> I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.
Are you sure? More likely, the vendor didn't dispute the successful chargebacks.
I think you might be talking about "fraudulent transaction/cardholder does not recognize" disputes. Yes, when using 3DS (which is now much more common at least in Europe, due to often being required by regulation in the EU/EEA), these are much less likely to be won by the issuer.
But "merchant does not let me cancel" isn't a fraud dispute (and in fact would probably be lost by the issuing bank if raised as such). Those "non-fraudulent disagreement with the merchant disputes" work very similarly in the US and in Europe.
No, you're just wrong here. Merchant doesn't let me cancel will almost always be won by the vendor when they demonstrate that they do allow cancellations within the bounds of the law and contracts. I've won many of these in the EU, too (we actually never lost a dispute for non-compliance with card network rules, because we were _very_ compliant).
I can only assume you are from the US and are assuming your experience will generalise, but it simply does not. Like night and day. Most EU residents who try using chargebacks for illegitimate dispute resolution learn these lessons quickly, as there are far more card cancellations for "friendly fraud" than merchant account closures for excessive chargebacks in the EU - the polar opposite of the US.
You’re assuming wrong.
And have you won one of these cases in a scenario where the merchant website has a blanket IP ban? That seems very different from cardholders incapable of clicking an “unsubscribe” button they have access to.
One of requirements of Visa/Mastercard is for the customer to be able to contact merchant post-purchase.
Only via the original method of commerce. An online retailer who geoblocks users does not have to open the geoblock for users who move into the geoblocked regions.
I have first-hand experience, as I ran a company that geoblocked US users for legal reasons and successfully defended chargebacks by users who made transactions in the EU and disputed them from the US.
Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.
> Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.
What's true is that in the US, the cardholder can often just say "I've never heard of that merchant", since 3DS is not really a thing, and generally merchants are relatively unlikely to have compelling evidence to the contrary.
But for all non-fraud disputes, they follow the same process.
As commented elsewhere, you're just wrong. It's a significant burden of proof for a cardholder to win a dispute for non-compliance with card network rules and it very rarely happens (outside of actual merchant fraud, which is much rarer in the EU).
Again, you're not aware of the reality outside the US.
> It's a significant burden of proof for a cardholder to win a dispute for non-compliance with card network rules
That's true, but "fraud" and "compliance" aren't the only dispute categories, not by far.
In this case, using Mastercard as an example (as their dispute rules are public [1]), the dispute category would be "Refund not processed".
The corresponding section explicitly lists this as a valid reason: "The merchant has not responded to the return or the cancellation of goods or services."
> Again, you're not aware of the reality outside the US.
Repeating your incorrect assumption doesn't make it true.
[1] https://www.mastercard.us/content/dam/public/mastercardcom/n...
Okay, so you're grasping at straws here, because:
a) a Refund Not Processed chargeback is for non-compliance with card network rules,
and b), When the merchant informed the cardholder of its refund policy at the time of purchase, the cardholder must abide by that policy.
We won these every time, because we had a lawful and compliant refund policy and we stuck to it. These are a complete non-issue for vendors outside the US, unless they are genuinely fraudulent.
Honestly, I think you have no experience with card processors outside the US (or maybe at all) and you just can't admit you're wrong, but anyone with experience would tell you how wrong you are in a heartbeat. The idea you can "defeat" geoblocks with chargebacks is much more likely to result in you losing access to credit than a refund.
Are you even trying to see things from a different perspective, or are you just dead set on winning an argument via ad hominems based on incorrect assumptions about my background?
It's quite possible that both of our experiences are real – at least I'm not trying to cast doubt on yours – but my suspicion is that the generalization you're drawing from yours (i.e. chargeback rules, or at least their practical interpretation, being very different between the US and other countries) isn't accurate.
Both in and outside the US, merchants can and do win chargebacks, but a merchant being completely unresponsive to cancellation requests of future services not yet provided (i.e. not of "buyer's remorse" for a service that's not available to them, per terms and conditions) seems like an easy win for the issuer.
"Visiting the website" is the method. It's nonsense to say that visiting from a different location is a different method. I don't care if you won those disputes, you did a bad thing and screwed over your customers.
> Visiting the website" is the method. It's nonsense to say that visiting from a different location is a different method.
This is a naive view of the internet that does not stand the test of legislative reality. It's perfectly reasonable (and in our case was only path to compliance) to limit access to certain geographic locations.
> I don't care if you won those disputes, you did a bad thing and screwed over your customers.
In our case, our customers were trying to commit friendly fraud by requesting a chargeback because they didn't like a geoblock, which is also what the GP was suggesting.
Using chargebacks this way is nearly unique to the US and thankfully EU banks will deny such frivolous claims.
The ancestor post was about being unable to get support for a product, so I thought you were talking about the same situation. Refusal to support is a legitimate grievance.
Are you saying they tried a chargeback just because they were annoyed at being unable to reach your website? Something doesn't add up here, or am I giving those customers too much credit?
Were you selling them an ongoing website-based service? Then the fair thing would usually be a prorated refund when they change country. A chargeback is bad but keeping all their money while only doing half your job is also bad.
If you read back in the thread, we're talking about the claim that adding geoblocking will result in chargebacks, which outside the US, it won't.
> Are you saying they tried a chargeback just because they were annoyed at being unable to reach your website?
In our case it was friendly fraud when users tried to use a service which we could not provide in the US (and many other countries due to compliance reasons) and had signed up in the EU, possibly via VPN.
What was inaccessible to them: The service itself, or any means to contact the merchant to cancel an ongoing subscription?
I can imagine a merchant to win a chargeback if a customer e.g. signs up for a service using a VPN that isn't actually usable over the same VPN and then wants money for their first month back.
But if cancellation of future charges is also not possible, I'd consider that an instance of a merchant not being responsive to attempts at cancellation, similar to them simply not picking up the phone or responding to emails.
Usually CC companies require email records (another way of communicating with a company) showing you attempted to resolve the problem but could not. I don’t think “I tried to visit the website that I bought X item from while in Africa and couldn’t get to it” is sufficient.
At that point, I wonder if an online shop is even necessary. Just sell in-person.
I'm not precisely sure the point you're trying to make.
In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.
Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.
It definitely works, since you’re externalizing your annoyance to people you literally won’t ever hear from because you blanket banned them based. Most of them will just think your site is broken.
This isn't coming from nowhere though. China and Russia don't just randomly happen to have been assigned more bad actors online.
Due to frosty diplomatic relations, there is a deliberate policy to do fuck all to enforce complaints when they come from the west, and at least with Russia, this is used as a means of gray zone cyberwarfare.
China and Russia are being antisocial neighbors. Just like in real life, this does have ramifications for how you are treated.
It seems to be a choice they’re making with their eyes open. If folks running a storefront don’t want to associate with you, it’s not personal in that context. It’s business.
In other words, a smart business practice.
> Why stop there? Just block all non-US IPs!
This is a perfectly good solution to many problems, if you are absolutely certain there is no conceivable way your service will be used from some regions.
> Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
Not a problem. Bad actors which are motivated enough to use VPNd or botnets are a different class of attacks that have different types of solutions. If you eliminate 95% of your problems with a single IP filter them you have no good argument to make against it.
This. If someone wants to target you, they will target you. What this does is remove the noise and 90%+ of crap.
Basically the same thing as changing the ssh port on a public facing server, reduce the automated crap attacks.
> if you are absolutely certain there is no conceivable way your service will be used from some regions.
This isn’t the bar you need to clear.
It’s “if you’re comfortable with people in some regions not being able to use your service.”
Won't help: I get scans and script kiddy hack attempts from digital ocean, microsoft cloud (azure, stretchoid.com), google cloud, aws, and lately "hostpapa" via its 'IP colocation service'. Ofc it is instant fail-to-ban (it is not that hard to perform a basic email delivery to an existing account...).
Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).
Public IP services are done for: going to be hell whatever you do.
The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".
And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...
Don't care, works fine for us.
Worked great for us, but I had to turn it off. Why? Because the IP databases that the two services I was using are not accurate enough and some people in the US were being blocked as if they had a foreign IP address. It happened regularly enough I reluctantly had to turn it off and now I have to deal the non-stop hacking attempts on the website.
For the record, my website is a front end for a local-only business. Absolutely no reason for anyone outside the US to participate.
And that's perfectly fine. Nothing is completely bulletproof anyway. If you manage to get rid of 90% of the problem then that's a good thing.
Okay, but this causes me about 90% of my major annoyances. Seriously. It’s almost always these stupid country restrictions.
I was in UK. I wanted to buy a movie ticket there. Fuck me, because I have an Austrian ip address, because modern mobile backends pass your traffic through your home mobile operator. So I tried to use a VPN. Fuck me, VPN endpoints are blocked also.
I wanted to buy a Belgian train ticket still from home. Cloudflare fuck me, because I’m too suspicious as a foreigner. It broke their whole API access, which was used by their site.
I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too. And of course my bank card… and I just wanted to order a pizza.
The most annoying is when your fucking app is restricted to your stupid country, and I should use it because your app is a public transport app. Lovely.
And of course, there was that time when I moved to an other country… pointless country restrictions everywhere… they really helped.
I remember the times when the saying was that the checkout process should be as frictionless as possible. That sentiment is long gone.
The vpn is probably your problem there mate.
I don’t use VPN generally, only in specific cases. For example, when I want to reach Australian news. Because of course, as a non Australian, I couldn’t care about local news. Or when American pages rather ban Europe than they would tell who they sell my data to.
They tried a VPN as a backup for one of those problems.
So no. It's not.
> I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too.
Your mobile provider was routing you through Austria while in the US?
Not OP, but as far as I know that's how it works, yeah.
When I was in China, using a Chinese SIM had half the internet inaccessible (because China). As I was flying out I swapped my SIM back to my North American one... and even within China I had fully unrestricted (though expensive) access to the entire internet.
I looked into it at the time (now that I had access to non-Chinese internet sites!) and forgot the technical details, but seems that this was how the mobile network works by design. Your provider is responsible for your traffic.
Yes, newer backends for 4G and 5G networks work exactly that way.
Even 2G and 3G data roaming used to work that way.
If anything, the opposite behavior (i.e. getting a local or regional IP instead of one from your home network) is a relatively new development.
[dead]
And if your competitor manages to do so without annoying the part of their customer base that occasionally leaves the country, everybody wins!
Fair point, that's something to consider.
You think all streaming services have banned non US IPs? What world do you live in?
This is based on personal experience. At least two did not let me unsubscribe from abroad in the past.
Not letting you unsubscribe and blocking your IP are very different things.
There are some that do not provide services in most countries but Netflix, Disney, paramount are pretty much global operations.
HBO and peacock might not be available in Europe but I am guessing they are in Canada.
I think a lot of services end up sending you to a sort of generic "not in your country yet!" landing page in an awkward way that can make it hard to "just" get to your account page to do this kind of stuff.
Netflix doesn't have this issue but I've seen services that seem to make it tough. Though sometimes that's just a phone call away.
Though OTOH whining about this and knowing about VPNs and then complaining about the theoretical non-VPN-knower-but-having-subscriptions-to-cancel-and-is-allergic-to-phone-calls-or-calling-their-bank persona... like sure they exist but are we talking about any significant number of people here?
In Europe we have all of them, with only few movies unavailable or additionally paid occasionally. Netflix, Disney, HBO, Prime and others work fine.
Funny to see how narrow perspective some people have…
Obligatory side note of "Europe is not a country".
In several European countries, there is no HBO since Sky has some kind of exclusive contract for their content there, and that's where I was accordingly unable to unsubscribe from an US HBO plan.
> Not letting you unsubscribe and blocking your IP are very different things.
When you posted this, what did you envision in your head for how they were prevented from unsubscribing, based on location, but not via IP blocking? I'm really curious.
> Not letting you unsubscribe and blocking your IP are very different things.
How so? They did not let me unsubscribe via blocking my IP.
Instead of being able to access at least my account (if not the streaming service itself, which I get – copyright and all), I'd just see a full screen notice along the lines of "we are not available in your market, stay tuned".
> Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
They usually don't bother. Plus it's easier to take action against malicious traffic within your own country or general jurisdiction.
Oddly, my bank has no problem with non-US IPs, but my City's municipal payments site doesn't. I always think it's broken for a moment before realizing I have my VPN turned on.
The percentage of US trips abroad which are to China must be minuscule, and I bet nobody in the US regularly uses a VPN to get a Chinese IP address. So blocking Chinese IP addresses is probably going to have a small impact on US customers. Blocking all abroad IP addresses, on the other hand, would impact people who just travel abroad or use VPNs. Not sure what your point is or why you're comparing these two things.
[dead]
If you are traveling without a vpn then you are asking for trouble
Yes, and I’m arguing that that’s due to companies engaging in silly pseudo-security. I wish that would stop.
It is not silly pseudo-security, it is economics. Ban Chinese, lower your costs while not losing any revenue. It is capitalism working as intended.
Not sure I'd call dumping externalities on a minority of your customer base without recourse "capitalism working as intended".
Capitalism is a means to an end, and allowable business practices are a two-way street between corporations and consumers, mediated by regulatory bodies and consumer protection agencies, at least in most functioning democracies.
Maybe, but it doesn't change the fact, that no one is going to forbid me to ban IPs. Therefore I will ban IPs and IPs ranges because it is the cheapest solution.
Sure, you can keep blocking IPs, and I'll keep arguing for a ban on IP country bans (at least for existing customers) :)
If you don't see that your campaign is futile and want to waste you time, just go ahead, don't ask for my permission.
Moving a cost outside the business and then calling it improved margin is exactly what MBA school teaches and the market rewards.
so you just raw dog hotel and conference wifi?
Block Russia too, thats where i see most of my bot traffic coming from
And usually hackers/malicious actors from that country are not afraid to attack anyone that is not russian, because their local law permits attacking targets in other countries.
(It sometimes comes to funny situations where malware doesn't enable itself on Windows machines if it detects that russian language keyboard is installed.)
Lately I've been thinking that the only viable long-term solution are allowlists instead of blocklists.
The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.
Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.
I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.
This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.
I guess this is what "Identity aware proxy" from GCP can do for you? Outsource all of this to google - where you can connect your own identity servers, and then your service will only be accessed after the identity has been verified.
We have been using that instead of VPN and it has been incredibly nice and performant.
Yeah, I suppose it's something like that. Except that my solution wouldn't rely on Google, would be open source and self-hostable. Are you aware of a similar project that does this? Would save me some time and effort. :)
There also might be similar solutions for other cloud providers or some Kubernetes-adjacent abomination, but I specifically want something generic and standalone.
https://github.com/topics/identity-aware-proxy
It all started with an inverted killfile...
Lmao I came here to post this. My personal server was making constant hdd grinding noises before I banned the entire nation of China. I only use this server for jellyfin and datahoarding so this was all just logs constantly rolling over from failed ssh auth attempts (PSA: always use public-key, don't allow root, and don't use really obvious usernames like "webadmin" or <literally just the domain>).
Changing the SSH port also helps cut down the noise, as part of a layered strategy.
Are you familiar with port knocking? My servers will only open port 22, or some other port, after two specific ports have been knocked on in order. It completely eliminates the log files getting clogged.
I've used that solution in the past. What happens when the bots start port knocking?
The bots have been port scanning me for decades. They just don't know which two ports to hit to open 22 for their IP address. Simply iterating won't get then there, and fail2ban doesn't afford them much opportunity to probe.
Fail2ban :)
Did you really notice a significant drop off in connection attempts? I tried this some years ago and after a few hours on a random very high port number I was already seeing connections.
I use a non standard port and have not had an unknown IP hit it in over 25 years. It's not a security feature for me, I use that to avoid noise.
My public SFTP servers are still on port 22 and but block a lot of SSH bots by giving them a long "versionaddendum" /etc/ssh/sshd_config as most of them choke on it. Mine is 720 characters long. Older SSH clients also choke on this so test it first if going this route. Some botters will go out of their way to block me instead so their bots don't hang. One will still see the bots in their logs, but there will be far less messages and far fewer attempts to log in as they will be broken, sticky and confused. Be sure to add offensive words in versionaddendum for the sites that log SSH banners and display them on their web pages like shodan.io.
In my experience can cut out the vast majority of ssh connection attempts by just blocking a couple IPs. ... particularly if you've already disabled password auth because some of the smarter bots notice that and stop trying.
Most of the traffic comes from China and Singapore, so I banned both. I might have to re-check and ban other regions who would never even visit my stupid website anyway. The ones who want to are free to, through VPN. I have not banned them yet.
I have my jellyfin and obsidian couchdb sync on my Tailscale and they don’t see any public traffic.
Naive question: why isn't there a publicly accessible central repository of bad IPs and domains, stewarded by the industry, operated by a nonprofit, like W3C? Yes it wouldn't be enough by itself ("bad" is a very subjective term) but it could be a popular well-maintained baseline.
there are many of these and they are always outdated
another issue is things like cloud hosting will overlap their ranges with legit business ranges happily, so if you go that route you will inadvertently also block legitimate things. not that a regular person care too much for that, but an abuse list should be accurate.
https://xkcd.com/927/
For what it's worth, I'm also guilty of this, even if I made my site to replace one that died.
Is it fair game to just return fire and sink the crawlers? A few thousand connections via a few dozen residential proxies might do it.
I've seen blocks like that for e.g. alibaba cloud. It's sad indeed, but it can be really difficult to handle aggressive scrapers.
"I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world. If China scrapes content, that's fine as far as the CCP goes; If it's blocked, that's fine by the CCP too (I say, as I adjust my tin foil hat)."
Then turn the tables on them and make the Great Firewall do your job! Just choose a random snippet about illegal Chinese occupation of Tibet or human rights abuses of Uyghur people each time you generate a page and insert it as a breaker between paragraphs. This should get you blocked in no time :)
I just tried this, i took some strings about Falun Gong and the Tianmen thing from the chinese wikipedia and put them into my SSH server banner. The connection attempts from the Tencent AS ceased completely, but now they come from Russia, Lithuania and Iran instead.
Whoa, that's fascinating. So their botnet runs in multiple regions and will auto-switch if one has problems. Makes sense. Seems a bit strange to use China as the primary, though. Unless of course the attacker is based in China? Of the countries you mentioned Lithuania seems a much better choice. They have excellent pipes to EU and North America, and there's no firewall to deal with
FAFO from both sides. Not defending this bot at all. That said, the shenanigans some rogue or clueless webmasters are up to blocking legitimate and non intrusive or load causing M2M trafic is driving some projects into the arms of 'scrape services' that use far less considerate nor ethical means to get to the data you pay them for.
IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
Exactly. If someone can harm your website on accident, they can absolutely harm it on purpose.
If you feel like you need to do anything at all, I would suggest treating it like any other denial-of-service vulnerability: Fix your server or your application. I can handle 100k clients on a single box, which equates to north of 8 billion daily impressions, and so I am happy to ignore bots and identify them offline in a way that doesn't reveal my methodologies any further than I absolutely have to.
> IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
That's traffic I want to block, and that's behaviour that I want to punish / discourage. If a set of users get caught up in that, even when they've just been given recycled IP addresses, then there's more chance to bring the shitty 'scraping as a service' behaviour to light, thus to hopefully disinfect it.
(opinion coming from someone definitely NOT hosting public information that must be accessible by the common populace - that's an issue requiring more nuance, but luckily has public funding behind it to develop nuanced solutions - and can just block China and Russia if it's serving a common populace outside of China and Russia).
Trust me, there's nothing 'nuanced' about the contractor that won the website management contract for the next 6-12 months by being the cheapest bidder for it.
What? Are you trying to say it's legitimate to want to scrape websites that are actively blocking you because you think you are "not intrusive"? And that this justifies paying for bad actors to do it for you?
I can't believe the entitlement.
No. I'm talking about literally legitimate, information that has to be public by law and/or regulation (typically gov stuff), in formats specifically meant for m2m consuption, and still blocked by clueless or malicious outsourced lowest bidder site managers.
And no, I do not use those paid services, even though it would make it much easier.
These IP addresses being released at some point, and making their way into something else is probably the reason I never got to fully run my mailserver from my basement. These companies are just massively giving IP addresses a bad reputation, messing them up for any other use and then abandoning them. I wonder what this would look like when plotted: AI (and other toxic crawling) companies slowly consuming the IPv4 address space? Ideally we'd forced them into some corner of the IPv6 space I guess. I mean robots.txt seems not to be of any help here.
I've mentioned my project[0] before, and it's just as sledgehammer-subtle as this bot asks.
I have a firewall that logs every incoming connection to every port. If I get a connection to a port that has nothing behind it, then I consider the IP address that sent the connection to be malicious, and I block the IP address from connecting to any actual service ports.
This works for me, but I run very few things to serve very few people, so there's minimal collateral damage when 'overblocking' happens - the most common thing is that I lock myself out of my VPN (lolfacepalm).
I occasionally look at the database of IP addresses and do some pivot tabling to find the most common networks and have identified a number of cough security companies that do incessant scanning of the IPv4 internet among other networks that give me the wrong vibes.
[0]: Uninvited Activity: https://github.com/UninvitedActivity/UninvitedActivity
P.S. If there aren't any Chinese or Russian IP addresses / networks in my lists, then I probably block them outright prior to the logging.
> Alex Schroeder's Butlerian Jihad
That's Frank Herbert's Butlerian Jihad.
To be fair, he was referring to a post on Alex Schroeder's blog titled with the same name as the term from the Dune books. And that post correctly credits Dune/Herbert. But the post is not about Dune, it's about Spam bots so it's more related to what the original author's post is about.
Speaking of the Butlerian Jihad, Frank Herbert's son (Brian) and another author named Kevin J Anderson co-wrote a few books in the Dune universe and one of them was about the Butlerian Jihad. I read it. It was good, not as good at Frank Herbert's books but I still enjoyed it. One of the authors is not as good as the other because you can kind of tell the writing quality changing per chapter.
https://en.wikipedia.org/wiki/Dune:_The_Butlerian_Jihad
That's really hard to believe. Brian Herbert's stuff seems sort of Fan Fiction In The World of Dune. Nothing wrong with fan fiction: The Last Ringbearer etc. are pretty enjoyable. But BH just follows on. His work has a bit of the feeling of people who lived in the ruins of the roman forum https://x.com/museiincomune/status/1799039086906474572
Those books completely misrepresent Frank Herbert's original ideas for Butlerian Jihad. It wasn't supposed to be a literal war against genocidal robots.
Interesting to think that the answer to banning thinking computers in Dune was basically to indoctrinate kids from birth (mentats) and/or doing large quantities of drugs (guild navigators).
I feel like people seem to forget that an HTTP request is, after all, a request. When you serve a webpage to a client, you are consenting to that interaction with a voluntary response.
You can blunt instrument 403 geoblock entire countries if you want, or any user agent, or any netblock or ASN. It’s entirely up to you and it’s your own server and nobody will be legitimately mad at you.
You can rate limit IPs to x responses per day or per hour or per week, whatever you like.
This whole AI scraper panic is so incredibly overblown.
I’m currently working on a sniffer that tracks all inbound TCP connections and UDP/ICMP traffic and can trigger firewall rule addition/removal based on traffic attributes (such as firewalling or rate limiting all traffic from certain ASNs or countries) without actually having to be a reverse proxy in the HTTP flow. That way your in-kernel tables don’t need to be huge and they can just dynamically be adjusted from userspace in response to actual observed traffic.
> This whole AI scraper panic is so incredibly overblown.
The problem is that its eating into peoples costs, and if you're not concerned with money, I'm just asking, can you send me $50.00 USD ?
If people don’t want to spend the money serving the requests, then their own servers are misconfigured because responding is optional.
It sure is, now the problems i'd like to respond to legitimate users, I dont care for bots, if theres idle resources why not.
Care to share how I can make that happen given scrapers are hellbent on ignoring any rules / agreements on how to conduct themselves?
So, that is a no on the fifty?
When AI can now register and break captures on your site to login, how do I compete with this arms race of defeating my protection from AI ?
You can block traffic by AS effectively. In my case, I have seen a large number of crawlers from Tencent and Alibaba.
I signed up for the free AS database from IP2Location LITE and then blocked the ranges from those ASNs.
Unfortunately, HN itself is occasionally used for publicising crawling services that rely on underhand techniques that don't seem terribly different to the ones here.
I don't know if its because they operate in the service of capital rather than China, as here, but use of those methods in the former case seems to get more of a pass here.
Any respectable web scale crawler(/scraper) should have reverse DNS so that it can automatically be blocked.
Though it would seem all bets are off and anyone will scrape anything. Now we're left with middlemen like cloudflare that cost people millions of hours of time ticking boxes to prove they're human beings.
> A further check showed that all the network blocks are owned by one organization—Tencent. I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world.
A simple check against the IP address 170.106.176.0, 150.109.96.0, 129.226.160.0, 49.51.166.0 and 43.135.0.0 showed that these IP addresses is allocated to Tencent Cloud, a Google Cloud-like rental service.
I'm using their product personally, it's really cheap, a little more than $12~$20 a year for a VPS, and it's from one of the top Internet company.
Sure, it can't really completely rule out the possibility that Tencent is behind all of this, but I don't really think the communist needs to attack your website through Tencent, it's just simply not logical.
More likely it's just some company rented some server on Tencent crawling the Internet. The rest is probably just your xenophobia fueled paranoia.
In this day and age of crab barreling over one another, simple gestures such as honoring _robots.txt_ are just completely ignored.
This is everything I have for AS132203 (Tencent). It has your addresses plus others I have found and confirmed using ipinfo.io
43.131.0.0/18 43.129.32.0/20 101.32.0.0/20 101.32.102.0/23 101.32.104.0/21 101.32.112.0/23 101.32.112.0/24 101.32.114.0/23 101.32.116.0/23 101.32.118.0/23 101.32.120.0/23 101.32.122.0/23 101.32.124.0/23 101.32.126.0/23 101.32.128.0/23 101.32.130.0/23 101.32.13.0/24 101.32.132.0/22 101.32.132.0/24 101.32.136.0/21 101.32.140.0/24 101.32.144.0/20 101.32.160.0/20 101.32.16.0/20 101.32.17.0/24 101.32.176.0/20 101.32.192.0/20 101.32.208.0/20 101.32.224.0/22 101.32.228.0/22 101.32.232.0/22 101.32.236.0/23 101.32.238.0/23 101.32.240.0/20 101.32.32.0/20 101.32.48.0/20 101.32.64.0/20 101.32.78.0/23 101.32.80.0/20 101.32.84.0/24 101.32.85.0/24 101.32.86.0/24 101.32.87.0/24 101.32.88.0/24 101.32.89.0/24 101.32.90.0/24 101.32.91.0/24 101.32.94.0/23 101.32.96.0/20 101.33.0.0/23 101.33.100.0/22 101.33.10.0/23 101.33.10.0/24 101.33.104.0/21 101.33.11.0/24 101.33.112.0/22 101.33.116.0/22 101.33.120.0/21 101.33.128.0/22 101.33.132.0/22 101.33.136.0/22 101.33.140.0/22 101.33.14.0/24 101.33.144.0/22 101.33.148.0/22 101.33.15.0/24 101.33.152.0/22 101.33.156.0/22 101.33.160.0/22 101.33.164.0/22 101.33.168.0/22 101.33.17.0/24 101.33.172.0/22 101.33.176.0/22 101.33.180.0/22 101.33.18.0/23 101.33.184.0/22 101.33.188.0/22 101.33.24.0/24 101.33.25.0/24 101.33.26.0/23 101.33.30.0/23 101.33.32.0/21 101.33.40.0/24 101.33.4.0/23 101.33.41.0/24 101.33.42.0/23 101.33.44.0/22 101.33.48.0/22 101.33.52.0/22 101.33.56.0/22 101.33.60.0/22 101.33.64.0/19 101.33.64.0/23 101.33.96.0/22 103.52.216.0/22 103.52.216.0/23 103.52.218.0/23 103.7.28.0/24 103.7.29.0/24 103.7.30.0/24 103.7.31.0/24 43.130.0.0/18 43.130.64.0/18 43.130.128.0/19 43.130.160.0/19 43.132.192.0/18 43.133.64.0/19 43.134.128.0/18 43.135.0.0/18 43.135.64.0/18 43.135.192.0/19 43.153.0.0/18 43.153.192.0/18 43.154.64.0/18 43.154.128.0/18 43.154.192.0/18 43.155.0.0/18 43.155.128.0/18 43.156.192.0/18 43.157.0.0/18 43.157.64.0/18 43.157.128.0/18 43.159.128.0/19 43.163.64.0/18 43.164.192.0/18 43.165.128.0/18 43.166.128.0/18 43.166.224.0/19 49.51.132.0/23 49.51.140.0/23 49.51.166.0/23 119.28.64.0/19 119.28.128.0/20 129.226.160.0/19 150.109.32.0/19 150.109.96.0/19 170.106.32.0/19 170.106.176.0/20
For anyone wondering how to do this (like me from a month or two back).
Here's a useful tool/site:
https://bgp.tools
You can feed it an ip address to get an AS ("Autonomous System"), then ask it for all prefixes associated with that AS.
I fed it that first ip address from that list (43.131.0.0) and it showed my the same Tencent owned AS132203, and it gives back all the prefixes they have here:
https://bgp.tools/as/132203#prefixes
(Looks like roguebloodrage might have missed at least the 1.12.x.x and 1.201.x.x prefixes?)
I started searching about how to do that after reading a RachelByTheBay post where she wrote:
Enough bad behavior from a host -> filter the host.
Enough bad hosts in a netblock -> filter the netblock.
Enough bad netblocks in an AS -> filter the AS. Think of it as an "AS death penalty", if you like.
(from the last part of https://rachelbythebay.com/w/2025/06/29/feedback/ )
This is what I've used to find ASs to block: https://hackertarget.com/as-ip-lookup/
eg. Chuck 'Tencent' into the text box and execute.
I add re-actively. I figure there are "legitimate" IP's that companies use and I only look at IP addresses that are 'vandalizing' my servers with inappropriate scans and block them.
If I saw the two you have identified, then they would have been added. I do play a balance between "might be a game CDN" or a "legit server" and an outright VPS that is being used to abuse other servers.
But thanks, I will keep an eye on those two ranges.
FWIW, I looked through my list of ~8000 IP addresses, there isn't as many hits for these ranges as I would have thought. It's possible that they're more focused on using known DNS names than simply connecting to 80/443 on random IPs.
Edit: I also checked my Apache logs, I couldn't find any recent logs for "thinkbot".
For the Thinkbot problem mentioned in the article, it's less maintenance work to simply block on the User Agent string.
jep, good tip! for ppl that do this be sure to make it case insensitive and only capture few distinct parts, not too specific. especially if u only expect browsers this can mitigate a lot.
u can also filter for allowing but this gives a risk of allowing the wrong thing as headers are easy to set, so its better to do it via blocking (sadly)
i think there is an opportunity to train an neural network on browser user agent s(they are catalogued but vary and change a lot). then u can block everything not matching.
it will work better than regex. a lot of these companies rely on 'but we are clearly recognizable' via fornexample these user agents, as excuse to put burden on sysadmins to maintains blocklists instead of otherway round (keep list of scrapables..)
maybe someone mathy can unburden them ?
you could also look who ask for nonexisting resources, and block anyone who asks for more than X (large enough not to let config issue or so kill regular clients). block might be just a minute so u dont have too many risk when an FP occurs. it will be enough likely to make the scraper turn away.
there are many things to do depending on context, app complexity, load etc. , problem is there's no really easy way to do these things.
ML should be able to help a lot in such a space??
What exactly do you want to train on a falsifiable piece of info? We do something like this at https://visitorquery.com in order to detect HTTP proxies and VPNs but the UA is very unreliable. I guess you could detect based on multiple pieces with UA being one of them where one UA must have x, y, z or where x cannot be found on one UA. Most of the info is generated tho.
Aren't many apartment buildings all coming from just a few IP addresses?
https://en.wikipedia.org/wiki/Carrier-grade_NAT
Yes, and this makes ip banning have false positives.
But ultimately it's worth it, you are responsible for your neighbours.
> [Y]ou are responsible for [how] your neighbours [use the Internet].
Nope.
I'm very much not responsible for snooping on my neighbor's private communications. If anyone is responsible for doing any sort of abuse monitoring, it is the ISP chosen by my neighbor.
If your CGNAT IP gets blocked then you are responsible for not complaining to your ISP that they're still doing CGNAT and that someone is being abusive within their network.
This is not a normative social prescription, but a descriptive natural phenomenon.
If there's a neighbour in your building who is running a bitcoin farm on your residential building, it's going to cause issues for you. If people from your country commit crime in other countries and violate visas, then you are going to face a quota due to them. If you bank at ACME Bank, and then it turns out they were arms traffickers, your funds were pooled and helped launder their money, you are responsible by association .
Reputation is not only individual, but there is group reputation, regardless of whether you like it or not.
> If there's a neighbour...
What ass-backwards jurisdiction do you live in where any of the things you mention in this paragraph are true, let alone the notion that uninvolved bystanders would be responsible for the behavior of others?
I feel like one annoying solution to bots would be putting your pages behind a simple account creation & logon screen.
Maybe it could be for your archive files or something.
Still a hassle but if 95% of your blog requires a login to view that would decrease the load quite a bit, right?
I've written a decent number of malicious crawlers in my time.
Be happy they game you a user agent.
Wouldn't it be better, if there's an easy way, to just feed such bots shit data instead of blocking them. I know it's easier to block and saves compute and bandwidth, but perhaps feeding them shit data at scale would be a much better longer term solution.
I recommend you use gzip_static and serve a zip-bomb instead. Frees up the connection sooner and probably causes bad crawlers to exhaust their resources.
https://zadzmo.org/code/nepenthes/
No serving shit data costs bandwidth and possibly compute time.
Blocking IPS is much cheaper for the blocker.
Zip bomb?
Doesn’t that tie up a socket on the server similarly to how a keepalive would on the bot user end?
I don't think so. The payload size of the bytes on the wire is small. This premise is all dependent on the .zip being crawled synchronously by the same thread/job making the request.
What if bots catch on to zip bombs, and just download them really slowly?
https://en.wikipedia.org/wiki/Zeno%27s_paradoxes#Dichotomy_p...
Why not just block the User Agent?
Because it's the single most falsifiable piece of information you would find on ANY "how to scrape for dummies" article out there. They all start with changing your UA.
Sure, but the article is about a bot that expressly identifies itself in the user agent and its user agent name contains a sentence suggesting you block its ip if you don’t like it. Since it uses at least 74 ips, blocking its user agent seems like a fine idea.
I think the UA is easily spoofed, whereas the AS and IP are less easily spoofed. You have everything you need already to spoof UA, while you will need resources to spoof your IP, whether it’s wall clock time to set it up, CPU time to insert another network hop, and/or peers or other third parties to route your traffic, and so on. The User Agent are variables that you can easily change, no real effort or expense or third parties required.
Bots often rotate the UA too, their entire goal is to get through and scrape as much content as possible, using any means possible.
because you have to parse the http request to do that, while blocking the IP can be done at the firewall
I know opinions are divided on what I am about to mention, but what about CAPTCHA to filter bots? Yes, I am well aware we're a decade past a lot of CAPTCHA being broken by _algorithms_, but I believe it is still a relatively useful general solution, technically -- question is, would we want to filter non-humans, effectively? I am myself on the fence about this, big fan of what HTTP allows us to do, and I mean specifically computer-to-computer (automation/bots/etc) HTTP clients. But with the geopolitical landscape of today, where Internet has become a tug of war (sometimes literally), maybe Butlerian Jihad was onto something? China and Russia are blatantly and near-openly shoving their fingers in every hole they can find, and if this is normalized so will Europe and U.S., for countermeasure (at least one could imagine it being the case). One could also allow bots -- clients unable to solve CAPTCHA -- access to very simplified, distilled and _reduced_ content, to give them the minimal goodwill to "index" and "crawl" for ostensibly "good" purposes.
I don’t understand why people want to block bots, especially from a major player like Tencent, while at the same time doing everything they can to be indexed by Google
I think banning IPs is a treadmill you never really get off of. Between cloud providers, VPNs, CGNAT, and botnets, you spend more time whack-a-moling than actually stopping abuse. What’s worked better for me is tarpitting or just confusing the hell out of scrapers so they waste their own resources.
There’s a great talk on this: Defense by numbers: Making Problems for Script Kiddies and Scanner Monkeys https://www.youtube.com/watch?v=H9Kxas65f7A
What I’d really love to see - but probably never will—is companies joining forces to share data or support open projects like Common Crawl. That would raise the floor for everyone. But, you know… capitalism, so instead we all reinvent the wheel in our own silos.
If you can automate the treadmill and set a timeout at which point the 'bad' IPs will go back to being 'not necessarily bad', then you're minimising the effort required.
An open project that classifies and records this - would need a fair bit of on-going protection, ironically.
As someone who uses VPNs all the time, these comments make me sad. Blocking by IP is not the solution.
Is there a way to reverse look up IPs by company? Like a list off all IPs owned by Alphabet, Meta Bing etc?
https://hackertarget.com/as-ip-lookup/
Chuck 'Tencent' into the text box and execute.
Oh i recognise those ip addresses… they gave us quite an headache a while ago
If ipv6 ever becomes a thing, it'll make blocking all that much harder.
No, it's really the same thing with just different (and more structured) prefix lengths. In IPv4 you usually block a single /32 address first, then a /24 block, etc. In IPv6 you start with a single /128 address, a single LAN is /64, an entire site is usually /56 (residential) or /48 (company), etc.
Note that for the sake of blocking internet clients, there's no point blocking a /128. Just start at /64. Blocking a /128 is basically useless because of SLAAC.
Some cloud providers only give out /128 so it's fair to start blocking just a /128 at first.
Hmmm... that isn't my experience:
/128: single application
/64: single computer
/56: entire building
/48: entire (digital) neighborhood
A /64 is the smallest network on which you can run SLAAC, so almost all VLANs should use this. /56 and /48 for end users is what RIRs are recommending, in reality the prefixes are longer, because ISPs and hosting providers wants you to pay like IPv6 space is some scarse resource.
[1]: https://www.ripe.net/publications/docs/ripe-690/
Everyone at my isp is issued a /56 (and as far as I can tell, the entire country is this way).
For ipv6 you just start nuking /64s and /48s if they're really rowdy.
[dead]
> Here's how it identifies itself: “Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.)”.
I mean you could just ban the user agent?
The real issue is with bots pretending not to be bots.
This one was funny because I checked a day’s log and it was using at least 15 different IPs. Much easier to just ban or rate limit “Thinkbot”
this is related: https://www.pcworld.com/article/2845330/hundreds-of-chrome-e...
The TL;DR is that there are malicious browser plugins that make the browser into a web scraping bot.
I see this all the time in web server logs; it is recognizable as a GET on a deep link coming from some random IP, usually residential.
This seems like a plot to redirect residential Chinese traffic through VPNs, which are supposedly mostly operated by only a few entities with a stomach for dishonest maneuvering and surveillance.
If a US IP is abusing the internet, you can go through the courts. If a foreign country is... good luck.
So, are hackers and internet shittery coming from China? Block China's ASNs. Too bad ISPs won't do that, so you have to do it yourself. Keep it blocked until China enforces computer fraud and abuse.
fiefdom internet 1.0 release party - information near your fingertips
What about IPv6?
It's a simple translation error. They really meant "Feed me worthless synthetic shit at the highest rate you feel comfortable with. It's also OK to tarpit me."
Is there an argument to be made that circumventing bans by changing ip addresses is illegal? Businesses have right of refusal!
I would never have considered this, but someone on HN pointed out that web user agents work like this. Servers send ads and there is no way for them to enforce that browsers render the ads and hide the content or whatever. The user agent is supposed to act for the user. "Your business model is not my problem", etc.
Well, my user agents work for me, not for you - the server guy who is complaining about this and that. "Your business model is not my problem". Block me if you don't want me.
https://news.ycombinator.com/item?id=44975697
Well done on pointing out exactly what everyone here is saying in the most arrogant way possible. Also, well done on linking to your own comment where people explain this to you.
The problem is that there is no way to "block me if you don't want me". That's the entire issue. The methods these scrapers use mean it's nigh on impossible to block them.
So far it's actually not. Though it is getting harder.
I suspect we'll get integrity attestation or tokens before it becomes an unsurmountable problem to block bots.
"Your inability to engineer is not my problem".
See: https://news.ycombinator.com/item?id=45018660
We block China and Russia. DDOS attacks and other hack attempts went down by 95%.
We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.
How did you choose where to get the IP addresses to block? I guess I'm mostly asking where this problem (i.e. "get all IPs for country X") is on the scale from "obviously solved" to "hard and you need to play catch up constantly".
I did a quick search and found a few databases but none of them looks like the obvious winner.
I used CYMRU <https://www.team-cymru.com/ip-asn-mapping> to do the mapping for the post.
Maxmind's GeoIP database is the industry standard, I believe. You can download a free version of it.
If your site is behind cloudflare, blocking/challenging by country is a built-in feature.
MaxMind is very common, IPInfo is also good. https://ipinfo.io/developers/database-download
If you want to test your IP blocks, we have servers on both China and Russia, we can try to take a screenshot from there to see what we get (free, no signup) https://testlocal.ly/
The common cloud platforms allow you to do geo-blocking.
Same here. It sucks. But it's just cost vs reward at some point.
Which will ensure you never get customers from those countries. And so the circle closes ...
The regulatory burden of conducting business with countries like Russia or China is a critical factor that offhand comments like yours consistently overlook.
It is funny how people immediately jump to conclusions, while I was merely pointing out an circular argument. People immediately think I am jumping to "aid one side". Shows much more about them than me, actually.
Nobody is jumping to any conclusions except you with this very comment. ;)
My conclusion is: People downvote, but why, when I am merely stating, that the reasoning is circular? Are they unable to get this simple fact? Or are they interpreting more into it, than there is? What is more likely? I choose to believe the second one.
You realize we can't see the scores on other people's posts, right? We have no idea whether or how many downvotes you received. You're talking to people who have no interest in doing business with these countries, so I don't understand how this could be circular reasoning.
Your offhand comment also doesn't make sense in the context of this subthread. The effort companies have to invest to do business with Russia and China is prohibitively high, and that's a completely valid concern. It's not that everyone universally hates or loves these countries. It's simply impractical for most businesses to navigate those markets.
I used to run a forum that blocked IPs from outside Western Europe. I had no interest in users from beyond. It's not all about money.
We dont plan to target China or mainland russia. Both are run by an oppressive regime, and that not something we want to be part of.
[dead]
[dead]
[flagged]
[flagged]
> All of the "blockchain is only for drug dealing and scams" people will sooner or later realize that it is the exact type of scenarios that makes it imperative to keep developing trustless systems.
This is like saying “All the “sugar-sweetened beverages are bad for you” people will sooner or later realize it is imperative to drink liquids”. It is perfectly congruent to believe trustless systems are important and that the way the blockchain works is more harmful than positive.
Additionally, the claim is that cryptocurrencies are used like that. Blockchains by themselves have a different set of issues and criticisms.
Tell that to the "web3 is doing great" crowd.
I've met and worked with many people who never shilled a coin in their whole life and were treated as criminals for merely proposing any type of application on Ethereum.
I got tired of having people yelling online about how "we are burning the planet" and who refused to understand that proof of stake made energy consumption negligible.
To this day, I have my Mastodon instance on some extreme blocklist because "admin is a crypto shill" and their main evidence was some discussion I was having to use ENS as an alternative to webfinger so that people could own their identity without relying on domain providers.
The goalposts keep moving. The critics will keep finding reasons and workarounds. Lots of useful idiots will keep doubling down on the idea that some holy government will show up and enact perfect regulation, even though it's the institutions themselves who are the most corrupt and taking away their freedoms.
The open, anonymous web is on the verge of extinction. We no longer can keep ignoring externalities. We will need to start designing our systems in a way where everyone will need to either pay or have some form of social proof for accessing remote services. And while this does not require any type of block chains or cryptocurrency, we certainly will need to start showing some respect to all the people who were working on them and have learned a thing or two about these problems.
> and who refused to understand that proof of stake made energy consumption negligible.
Proof of stake brought with it its own set of flaws and failed to solve many of the ones which already existed.
> To this day, I have my Mastodon instance on some extreme blocklist because (…)
Maybe. Or maybe you misinterpreted the reason? I don’t know, I only have your side of the story, so won’t comment either way.
> The goalposts keep moving. The critics will keep finding reasons and workarounds.
As will proponents. Perhaps if initial criticisms had been taken seriously and addressed in a timely manner, there wouldn’t have been reason to thoroughly dismiss the whole field. Or perhaps it would’ve played out exactly the same. None of us know.
> even though it's the institutions themselves who are the most corrupt and taking away their freedoms.
Curious that what is probably the most corrupt administration in the history of the USA, the one actively taking away their citizens’ freedoms as we speak, is the one embracing cryptocurrency to the max. And remember all the times the “immutable” blockchains were reverted because it was convenient to those with the biggest stakes in them? They’re far from impervious to corruption.
> And while this does not require any type of block chains or cryptocurrency, we certainly will need to start showing some respect to all the people who were working on them and have learned a thing or two about these problems.
Er, no. For one, the vast majority of blockchain applications were indeed grifts. It’s unfortunate for the minority who had good intentions, but it is what it is. For another, they didn’t invent the concept of trustless systems and cryptography. The biggest lesson we learned from blockchains is how bad of a solution they are. I don’t feel the need to thank anyone for grabbing an idea, doing it badly, wasting tons of resources while ignoring the needs of the world, using it to scam others, then doubling down on it when presented with the facts of its failings.
> Curious that what is probably the most corrupt administration in the history of the USA, the one actively taking away their citizens’ freedoms as we speak, is the one embracing cryptocurrency to the max.
Your memory is quite selective. El Salvador has been pushing for Bitcoin way before that, so we already have our share of Banana Republic (which is the US is becoming) promoting cryptocurrencies.
Second, the US is "embracing" Bitcoin by backing it up and enabling the creation of off-chain financial instruments. It is a complete corruption and the complete opposite of "trustless systems".
Third, the corruption of the government and their interest in cryptocurrency are orthogonal: the UK is passing bizarre laws to control social media, the EU is pushing for backdoors in messaging systems every other year. None of these institutions are acting with the interests of their citizens at heart, and the more explicit this become the more we will need to have systems that can let us operate trustlessly.
> For another, they didn’t invent the concept of trustless systems and cryptography.
But they are the ones who are actually working and developing practical applications. They are the ones doing actual engineering and dealing with real challenges and solving the problems that people are now facing, such as "how the hell do we deny access to bad actors on the open global internet who have endless resources and have nothing to lose by breaking social norms"?
That read like a bizarre tangent, because it didn’t at all address the argument. To make it clear I’ll repeat the crux of my point, the conclusion that the other arguments lead up to, which you skipped entirely in your reply:
> They’re far from impervious to corruption.
That’s it. That’s the point. You brought up corruption, and I pointed out blockchains don’t actually prevent that. Which you seem to agree with, so I don’t get your response at all.
> But they are the ones who are actually working and developing practical applications.
No, they are not. If no one wants to use them because of all the things they do wrong, they are not practical.
> They are the ones doing actual engineering and dealing with real challenges and solving the problems that people are now facing
No, they are not. They aren’t solving real problems and that is exactly the problem. They are being used almost exclusively for grifts, scams, and hoarding.
> such as "how the hell do we deny access to bad actors on the open global internet who have endless resources and have nothing to lose by breaking social norms"?
That is not a problem blockchains solve. At all.
> You brought up corruption, and I pointed out blockchains don’t actually prevent that.
No. Let's not talk past each other. My point is not about "preventing corruption". My point is that the citizens can not rely on the current web as an system that works in their favor. My point is that corporations and governments both are using the current web to take away our freedoms, and that we will need systems that do not require trust and/or functional institutions to enforce the rules.
> They are being used almost exclusively for grifts, scams, and hoarding.
"If by whiskey" arguments are really annoying. I am talking about the people doing research in trustless systems. Zero-knowledge proofs. Anonymous transactions. Fraud-proof advertisement impressions.
Scammers, grifters have always existed. Money laundering always existed. And they still happen far more often in the "current" web. There will always be bad actors in any large scale system. My argument is not about "preventing corruption", but to have a system where good actors can act independently even if corruption is prevalent.
> That is not a problem blockchains solve.
Go ahead and try to build a system that keeps access to online resources available to everyone while ensuring that it is cheap for good actors and expensive for bad ones. If you don't want to have any type of blockchain, you will either have to create a whitelist-first network or you will have to rely on an all-powerful entity with policing powers.
[flagged]
[flagged]
[flagged]
is your trigger-word "Ethereum"? he's not even talking about trading crypto or anything you could remotely consider scammy, he's talking about a blockchain based naming system. you're freaking out over nothing, go home man...
Geoblocking China and Russia should be the default.