Ask HN: How does archive.is bypass paywalls?

133 points by flerovium 2 years ago

If it simply visits sites, it will face a paywall too. If it identifies itself as archive.is, then other people could identify themselves the same way.

RicoElectrico 2 years ago

Nice try, media company employee ;)

/jk

PTOB 2 years ago

My sentiments exactly.

fxtentacle 2 years ago

Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.

panopticon 2 years ago

> Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.
I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?
- phoenixreader 2 years ago
  
  Exactly. It also would not be difficult for website operators to embed hidden user info in their served pages, thereby finding out the archive.is account. This approach seems risky for archive.is.
- hda111 2 years ago
  
  They could just copy the div with the content over to evade detection of the website’s owner
tivert 2 years ago

> Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.
I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.
- Stagnant 2 years ago
  
  I don't think anyone knows who runs archive.is. I've tried looking into it a couple of times in the past but there is surprisingly little information to be found. It must cost thousands if not tens of thousands a month to host all that data and AFAIK they do not monetize it in any way. From what I gather it probably is some Russian person as there were some old stackoverflow conversations regarding the site that lead to an empty github account with a russian name. Also back in 2015 the site owner blocked all Finnish ip addresses due to "an incident at the border"[1]. Finnish IPs have since been unblocked. It appears the site owner somehow thought he could end up in EU wide blacklist which seemed like very conspiratorial thinking from him.
  1: https://archive.is/Pum1p
  
  killingtime74 2 years ago
  
  When I visit each page has three ads, Left right and bottom. Maybe you have an ad blocker?
  
  Swiftness6022 2 years ago
  
  [dead]
Hamuko 2 years ago

Would it be possible to check if archive.is is logged into a newspaper site by archiving one of the user management pages?
hoofhearted 2 years ago

Negative. I used to assume this as well, but they somehow also bypass local paywalls which have gotten me temporarily banned from r/Baltimore lol.
They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?
- jrochkind1 2 years ago
  
  Wait, you got banned from /r/Baltimore for posting archive.is links there? That's against the rules there? I would not have known that myself! (Also a Baltimorean).
  
  hoofhearted 2 years ago
  
  I even tried to convince them to be in the mindset that Paul Graham created Hacker News to get more mindshare on YC. He gave the idea of Reddit to the 3 brilliant Ivy League founders who applied to YC with a basic GMAIL extension I think that copied emails or something.
  So I tried convincing them that if it’s okay here on PG’s creation, then it should be okay on his other creation.
  
  hoofhearted 2 years ago
  
  Yeah! Hahah
  I thought knowledge was free, and the Baltimore sun sucked anyways. They charge money and don’t even write hood stuff anymore. They laid off a bunch of people, and moved printing to Delaware. My bet is the next step is that they announce they are shutting down all Locust Point operations, and are selling out so that Kevin Plank can build some new buildings there.
  I think I had to appeal my ban with a mod, and they mentioned how it’s posted all over by the auto bot that sharing links to websites that bypass paywalls are against their subreddit rules :(
  I even tried an official proposal to r/Baltimore to reconsider and life that rule. The general consensus on the poll was that people felt that the Baltimore sun and the writers should be getting paid for their work, and I shouldn’t be bypassing their paywalls lol.
  
  jrochkind1 2 years ago
  
  You did it ONCE and got banned?
  I still can't find anything in the subreddit rules that clearly says this. (Not that most people read the rules first). Why don't they just add it to the rules?
  This is one of the things I dislike most about reddit, it seems to be common to ban people for a single rules violation of a poorly-documented unstated rule.
  My main problem with reading the Sun online is it has so much adware that my browser slows to a crawl and sometimes crashes when I try to read it!
- dev_0 2 years ago
  
  [dead]
flerovium 2 years ago

But is it true? What evidence is there?
This is a plausible explanation but is it true?
stevefan1999 2 years ago

So scihub but for newspapers

throwaway81523 2 years ago

Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.

Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).

I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.

1ark 2 years ago

They are probably just checking headers such as user agent and cookies. Would copy whatever your normal browser sends and put it in the urllib.request. If that doesn’t work, then it is likely more sophisticated.
- throwaway81523 2 years ago
  
  I will try that, but a quick look at the error page makes me think it tries to run a javascript blob.
  
  ksala_ 2 years ago
  
  They're just checking the user agent
  $ curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: curl/7.54.1' | head -1 HTTP/2 403 $curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' | head -1 HTTP/2 200
  One "trick" is that Firefox (and I assume Chrome?) allow you to copy a request as curl - then you can just see if that works in the terminal, and if it does you can binary search for the required headers.
  
  chrisco255 2 years ago
  
  It probably does. But there are better modern tools like headless Chrome / Puppeteer that can fully render a page with scripts.
withinboredom 2 years ago

Sounds like an ADA lawsuit waiting to happen. I'd send the editor an email explaining how they've reduced usability of the site; especially if you're a paying customer.

World177 2 years ago

I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.

[1] https://i.imgur.com/lyeRTKo.png

[2] https://i.imgur.com/IlBhObn.png

jrochkind1 2 years ago

That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.
- World177 2 years ago
  
  It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.
  
  jrochkind1 2 years ago
  
  Definitely incorrect, the paths in the robots.txt are prefixes, so `/` means anything starting with `/`, that is, everything. Look up the specifications for the robots.txt for more information! (Or, for instance, look up how you'd block the whole site in robots.txt if you wanted to!)
  
  KomoD 2 years ago
  
  No, / means the entire site, since root and anything lower.
flerovium 2 years ago

That's an interesting idea, but is it true?
- World177 2 years ago
  
  Websites usually want their pages indexed for search engines, as it increases the traffic they receive. They also often try to allow archival usage. The robots.txt usually has defined user agents used by search engines defined, as one purpose is to reduce load on the website by not indexing pages that do not need to be indexed.
  It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)
  edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.
  Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.

chrisco255 2 years ago

Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.

strunz 2 years ago

Your missing the point of "how does it bypass firewalls"
- hoofhearted 2 years ago
  
  Surprisingly, nobody has mentioned this here yet. I’m thinking the key to this is SEO, SERP’s, and newspapers wanting Google to find and index their content.
  This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.
  To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..
  If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?
  I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.
- chrisco255 2 years ago
  
  Sorry, thought it was obvious. Since it's using backend infrastructure to fetch the assets, it can crawl them as a bot in the same way that search engines do, without allowing cookies to be saved. Since scripts are often involved in the full rendering of a page, it clearly does allow for the scripts to load before snapshotting the DOM. But only the DOM and the assets and styles are preserved. Scripts are not. Most paywalls are simple scripts. If you disable JS and cookies, you'll often see the full text of an article.
  
  killingtime74 2 years ago
  
  Some paywalls don't hide the content with JavaScript. It's just not there. They make you pay and then redirect you to another page.
  
  JeremyNT 2 years ago
  
  I browse with scripts disabled by default and while some paywalls rely on js to block interactions after load many simply send only partial content and a login dialog.
  archive.is does "something" to get the full page for sites that specifically do not send all the content to non-logged-in user agents, and it's definitely different / more complex than simply running noscript.
  
  joegibbs 2 years ago
  
  There are a lot of paywalls that are done server-side - for instance the Herald Sun, which is one of the biggest newspapers in Australia, does it like this. Even if you check the responses there's nothing in them but links to subscribe and a brief intro to the article.
- wackget 2 years ago
  
  paywalls*
- sshine 2 years ago
  
  [flagged]

lcnPylGDnU4H9OF 2 years ago

I think a browser extension which people who have access to the article use to send the article data to the archive server.

phoenixreader 2 years ago

You mean the pages are crowdsourced? I don’t think so because many pages are archived only upon request. If I ask to archive a new page, archive.is provides it very quickly. This is not possible if the archive is built from crowdsourced data.
AlbertCory 2 years ago

That is how RECAP works ("Pacer" spelled backwards).
In that case, the government is fine with it.
- wolverine876 2 years ago
  
  I think that's how Sci-hub works, at least at some time in the past.
  
  JCharante 2 years ago
  
  I thought people would send their journal credentials to Sci-hub
flerovium 2 years ago

Can you explain? Who has purchased the subscription? I'm sure there's a no-redistribution clause in the subscription agreement.
- lcnPylGDnU4H9OF 2 years ago
  
  The person who installed the browser extension would be paying the subscription and ignoring said clause.
  
  riku_iki 2 years ago
  
  curious if eventually companies with start watermarking articles and catch and sue extension users.
  
  lcnPylGDnU4H9OF 2 years ago
  
  I suspect most content publishers would go to the source. If there are people who are already willing to pay for subscriptions and ignore the terms of those subscriptions, it's not much of a stretch that they'll ignore the fact that they got their subscription cancelled once (or twice, or however many times). The publisher would more likely see results taking legal action against the archivist.
  
  dwater 2 years ago
  
  It didn't stop the RIAA from suing loads of people over downloading mp3s in the past 2 decades, claiming damages of thousands of dollars per song the individual downloaded.
  
  riku_iki 2 years ago
  
  in this case (archive.is) they have stronger case, since many people who potentially could buy subscription read it on archive.is because extension user violated terms of subscription.
  Also, extension likely has also terms of usage prohibiting uploading copyrighted content shifting liability on users.
  
  sam0x17 2 years ago
  
  *uploaded
  They went after seeders
  
  riku_iki 2 years ago
  
  downloaders also received legal letters.
  
  flerovium 2 years ago
  
  But what is the relationship between archive.is and the user who installed the extension?
  
  phneutral26 2 years ago
  
  The user helps free the Internet by using archive.is as an openly accessible backup platform.
  
  inconceivable 2 years ago
  
  dude... haha it's a random person on the internet who is doing it for free.
  
  lcnPylGDnU4H9OF 2 years ago
  
  They (archive.is) would have built the extension to send the current page content to their servers and the user would have installed it so they can archive internet pages. https://help.archive.org/help/save-pages-in-the-wayback-mach... (item 2)
  
  Stagnant 2 years ago
  
  You are confusing archive.is with archive.org. Although archive.is does have an extension[1] it doesn't appear to capture any of the page contents, it just simply sends the url for archive.is to crawl.
  1: https://chrome.google.com/webstore/detail/archive-page/gcaim...
  
  lcnPylGDnU4H9OF 2 years ago
  
  I wasn't exactly confusing them but yeah, I did link to an archive.org article. I was having difficulty finding something specific to archive.is.
  I think the distinction between the two is moot in this post. The question could very well have been "How does archive.org bypass paywalls?" Though it's interesting that archive.is seems to just crawl the URL. Indeed that means they wouldn't necessarily be able to bypass the paywall.

janejeon 2 years ago

> If it identifies itself as archive.is, then other people could identify themselves the same way.

Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.

lazzlazzlazz 2 years ago

It would be far better and more secure for archive.is to publish a public key on its site and then sign requests from its private key, which sites could optionally verify.
- sublinear 2 years ago
  
  You just described client certificate auth
- facile 2 years ago
  
  +1 on this!
flerovium 2 years ago

In theory, this might work. But is it true? Do lots of sites have an archive.is whitelist?
- arbitrage 2 years ago
  
  I really don't see why they would, if they're using a paywall in the first place.
- w1nst0nsm1th 2 years ago
  
  Follow the magnolia trail...

Miner49er 2 years ago

According to their blog they use AMP: https://blog.archive.today/post/675805841411178496/how-does-...

flerovium 2 years ago

This explanation is incomplete. Counterexample:
Amp pages are paywalled:
https://www.wsj.com/articles/freeze-or-cut-spending-fight-is... https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...
archive.is isn't: https://archive.md/LaiOX
- Deathmax 2 years ago
  
  For WSJ at least, it appears that archive.is is fetching the AMP page, which returns the full content of the article and is hidden with CSS, and modifying the page to unhide the paywalled content + hide ads.
  It might be using other techniques as well for bypassing paywalls, be it referer/user-agent spoofing (some old archives of sites that echo back HTTP request headers have archive.is sending a Referer of google.co.uk).
- nora-puchreiner 2 years ago
  
  Try this: https://www.wsj.com/amp/articles/freeze-or-cut-spending-figh...
- Reventlov 2 years ago
  
  I can access the wsj article without any account using https://gitlab.com/magnolia1234/bypass-paywalls-firefox-clea... (bypass paywall clean)
JohnFen 2 years ago

Wow, an actually good use for Amp? Amazing.
- Aachen 2 years ago
  
  I'm sure it was an accident or honest mistake!

armchairhacker 2 years ago

A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.

There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.

Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.

alex_young 2 years ago

Not specifically related to archive.is, but news sites have a tightrope to walk.

They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.

Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.

retrocryptid 2 years ago

Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.

Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.

Or maybe they're okay with letting the archive index their content...

Atlas22 2 years ago

Just FYI google and bing publish their user agent strings[1][2] for the crawlers. At least in my experience most of the typical ad-infested and paywalled news sites wont display the paywall if you change the user agent to a crawler they prefer.
[1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...
wolverine876 2 years ago

Doesn't almost every site on the web know exactly what the Google bot looks like?
peter422 2 years ago

Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.
- Aachen 2 years ago
  
  Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.

w1nst0nsm1th 2 years ago

If the people who know that tell you, they could lose access to said ressources.

But it's kind of an open secret, you just don't look in the right place.

thallosaurus 2 years ago

I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)

(https://archive.is/1h4UV)

xiekomb 2 years ago

I thought they used this browser extension: https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean

flerovium 2 years ago

That extension does work, but do we know they use it?
- marcod 2 years ago
  
  They don't always use it, because I can archive a new page from my mobile phone browser, which doesn't even support extensions.
  My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.
DrDentz 2 years ago

It's actually the opposite, for some news sites this extension links to archive.is because that's the only known way to bypass the paywall.
- nora-puchreiner 2 years ago
  
  There are known ways to bypass paywall which are just impossible to implement within a browser extension while trivial on 12ft or archive. For example, to use Ukrainian residential proxy as some news websites granted free access from.

jrochkind1 2 years ago

Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.

But I don't know the answer either.

Yujf 2 years ago

I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik

strunz 2 years ago

12ft.io also doesn't work or is disabled for many sites that archive.is still works on
- hda111 2 years ago
  
  Maybe because the creator of 12ft.io isn't anonymous
janejeon 2 years ago

Wouldn't sites be able to see that requests from 12ft.io isn't coming from Google's IPs?
- dpifke 2 years ago
  
  Yes.
  Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...
  You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json
  
  nora-puchreiner 2 years ago
  
  https://search.google.com/test/rich-results?url= operates from legit Googlebot IPs so it allows anyone to get the paywalled content even archive.is fails to fetch (from theinformation.com, for example)
  
  rahimnathwani 2 years ago
  
  "Google recommends using reverse DNS to verify..."
  This is almost right. They recommend two steps:
  1. Use reverse DNS to find the hostname the IP address claims to have. (The IP address block owner can put any hostname in here, even if they don't own/control the domain.)
  2. Assuming the claimed hostname is on one of Google's domains, do a forward DNS lookup to verify that the original IP address is returned.
  The second step is the important one.

firexcy 2 years ago

My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.

w1nst0nsm1th 2 years ago

Follow the magnolia trail...

shipscode 2 years ago

What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.

riffic 2 years ago

your browser usually downloads an entire article and certain elements are overlayed.

it's trivial to bypass most paywalls isn't it?

aidenn0 2 years ago

Not for some (I think the Wall Street Journal). Apparently the AMP version of the page does work this way for WSJ though, which is how IA gets around the paywall.

aaron695 2 years ago

[dead]

gregjor 2 years ago

[flagged]

not_your_vase 2 years ago

They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.

mr-pink 2 years ago

every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content

jwildeboer 2 years ago

It’s internet magic. <rainbowmagicsparkles.gif> ;)

jakedata 2 years ago

Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.