RicoElectrico 14 days ago

Nice try, media company employee ;)


  • PTOB 14 days ago

    My sentiments exactly.

fxtentacle 14 days ago

Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.

  • panopticon 14 days ago

    > Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

    I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?

    • phoenixreader 14 days ago

      Exactly. It also would not be difficult for website operators to embed hidden user info in their served pages, thereby finding out the archive.is account. This approach seems risky for archive.is.

    • hda111 14 days ago

      They could just copy the div with the content over to evade detection of the website’s owner

  • tivert 14 days ago

    > Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

    I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.

    • Stagnant 14 days ago

      I don't think anyone knows who runs archive.is. I've tried looking into it a couple of times in the past but there is surprisingly little information to be found. It must cost thousands if not tens of thousands a month to host all that data and AFAIK they do not monetize it in any way. From what I gather it probably is some Russian person as there were some old stackoverflow conversations regarding the site that lead to an empty github account with a russian name. Also back in 2015 the site owner blocked all Finnish ip addresses due to "an incident at the border"[1]. Finnish IPs have since been unblocked. It appears the site owner somehow thought he could end up in EU wide blacklist which seemed like very conspiratorial thinking from him.

      1: https://archive.is/Pum1p

      • killingtime74 14 days ago

        When I visit each page has three ads, Left right and bottom. Maybe you have an ad blocker?

  • Hamuko 14 days ago

    Would it be possible to check if archive.is is logged into a newspaper site by archiving one of the user management pages?

  • hoofhearted 14 days ago

    Negative. I used to assume this as well, but they somehow also bypass local paywalls which have gotten me temporarily banned from r/Baltimore lol.

    They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?

    • jrochkind1 13 days ago

      Wait, you got banned from /r/Baltimore for posting archive.is links there? That's against the rules there? I would not have known that myself! (Also a Baltimorean).

      • hoofhearted 13 days ago

        I even tried to convince them to be in the mindset that Paul Graham created Hacker News to get more mindshare on YC. He gave the idea of Reddit to the 3 brilliant Ivy League founders who applied to YC with a basic GMAIL extension I think that copied emails or something.

        So I tried convincing them that if it’s okay here on PG’s creation, then it should be okay on his other creation.

      • hoofhearted 13 days ago

        Yeah! Hahah

        I thought knowledge was free, and the Baltimore sun sucked anyways. They charge money and don’t even write hood stuff anymore. They laid off a bunch of people, and moved printing to Delaware. My bet is the next step is that they announce they are shutting down all Locust Point operations, and are selling out so that Kevin Plank can build some new buildings there.

        I think I had to appeal my ban with a mod, and they mentioned how it’s posted all over by the auto bot that sharing links to websites that bypass paywalls are against their subreddit rules :(

        I even tried an official proposal to r/Baltimore to reconsider and life that rule. The general consensus on the poll was that people felt that the Baltimore sun and the writers should be getting paid for their work, and I shouldn’t be bypassing their paywalls lol.

        • jrochkind1 13 days ago

          You did it ONCE and got banned?

          I still can't find anything in the subreddit rules that clearly says this. (Not that most people read the rules first). Why don't they just add it to the rules?

          This is one of the things I dislike most about reddit, it seems to be common to ban people for a single rules violation of a poorly-documented unstated rule.

          My main problem with reading the Sun online is it has so much adware that my browser slows to a crawl and sometimes crashes when I try to read it!

    • dev_0 14 days ago


  • flerovium 14 days ago

    But is it true? What evidence is there?

    This is a plausible explanation but is it true?

throwaway81523 14 days ago

Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.

Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).

I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.

  • 1ark 14 days ago

    They are probably just checking headers such as user agent and cookies. Would copy whatever your normal browser sends and put it in the urllib.request. If that doesn’t work, then it is likely more sophisticated.

    • throwaway81523 14 days ago

      I will try that, but a quick look at the error page makes me think it tries to run a javascript blob.

      • ksala_ 14 days ago

        They're just checking the user agent

            $ curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: curl/7.54.1' | head -1
            HTTP/2 403
            $curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' | head -1   
            HTTP/2 200  
        One "trick" is that Firefox (and I assume Chrome?) allow you to copy a request as curl - then you can just see if that works in the terminal, and if it does you can binary search for the required headers.
      • chrisco255 14 days ago

        It probably does. But there are better modern tools like headless Chrome / Puppeteer that can fully render a page with scripts.

  • withinboredom 14 days ago

    Sounds like an ADA lawsuit waiting to happen. I'd send the editor an email explaining how they've reduced usability of the site; especially if you're a paying customer.

World177 14 days ago

I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.

[1] https://i.imgur.com/lyeRTKo.png

[2] https://i.imgur.com/IlBhObn.png

  • jrochkind1 13 days ago

    That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.

    • World177 13 days ago

      It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.

      • jrochkind1 9 days ago

        Definitely incorrect, the paths in the robots.txt are prefixes, so `/` means anything starting with `/`, that is, everything. Look up the specifications for the robots.txt for more information! (Or, for instance, look up how you'd block the whole site in robots.txt if you wanted to!)

      • KomoD 11 days ago

        No, / means the entire site, since root and anything lower.

  • flerovium 13 days ago

    That's an interesting idea, but is it true?

    • World177 13 days ago

      Websites usually want their pages indexed for search engines, as it increases the traffic they receive. They also often try to allow archival usage. The robots.txt usually has defined user agents used by search engines defined, as one purpose is to reduce load on the website by not indexing pages that do not need to be indexed.

      It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)

      edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.

      Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.

chrisco255 14 days ago

Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.

  • strunz 14 days ago

    Your missing the point of "how does it bypass firewalls"

    • hoofhearted 13 days ago

      Surprisingly, nobody has mentioned this here yet. I’m thinking the key to this is SEO, SERP’s, and newspapers wanting Google to find and index their content.

      This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.

      To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..

      If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?

      I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.

    • chrisco255 14 days ago

      Sorry, thought it was obvious. Since it's using backend infrastructure to fetch the assets, it can crawl them as a bot in the same way that search engines do, without allowing cookies to be saved. Since scripts are often involved in the full rendering of a page, it clearly does allow for the scripts to load before snapshotting the DOM. But only the DOM and the assets and styles are preserved. Scripts are not. Most paywalls are simple scripts. If you disable JS and cookies, you'll often see the full text of an article.

      • killingtime74 14 days ago

        Some paywalls don't hide the content with JavaScript. It's just not there. They make you pay and then redirect you to another page.

      • JeremyNT 12 days ago

        I browse with scripts disabled by default and while some paywalls rely on js to block interactions after load many simply send only partial content and a login dialog.

        archive.is does "something" to get the full page for sites that specifically do not send all the content to non-logged-in user agents, and it's definitely different / more complex than simply running noscript.

      • joegibbs 14 days ago

        There are a lot of paywalls that are done server-side - for instance the Herald Sun, which is one of the biggest newspapers in Australia, does it like this. Even if you check the responses there's nothing in them but links to subscribe and a brief intro to the article.

lcnPylGDnU4H9OF 14 days ago

I think a browser extension which people who have access to the article use to send the article data to the archive server.

  • phoenixreader 14 days ago

    You mean the pages are crowdsourced? I don’t think so because many pages are archived only upon request. If I ask to archive a new page, archive.is provides it very quickly. This is not possible if the archive is built from crowdsourced data.

  • AlbertCory 14 days ago

    That is how RECAP works ("Pacer" spelled backwards).

    In that case, the government is fine with it.

    • wolverine876 14 days ago

      I think that's how Sci-hub works, at least at some time in the past.

      • JCharante 14 days ago

        I thought people would send their journal credentials to Sci-hub

  • flerovium 14 days ago

    Can you explain? Who has purchased the subscription? I'm sure there's a no-redistribution clause in the subscription agreement.

    • lcnPylGDnU4H9OF 14 days ago

      The person who installed the browser extension would be paying the subscription and ignoring said clause.

      • riku_iki 14 days ago

        curious if eventually companies with start watermarking articles and catch and sue extension users.

        • lcnPylGDnU4H9OF 14 days ago

          I suspect most content publishers would go to the source. If there are people who are already willing to pay for subscriptions and ignore the terms of those subscriptions, it's not much of a stretch that they'll ignore the fact that they got their subscription cancelled once (or twice, or however many times). The publisher would more likely see results taking legal action against the archivist.

          • dwater 14 days ago

            It didn't stop the RIAA from suing loads of people over downloading mp3s in the past 2 decades, claiming damages of thousands of dollars per song the individual downloaded.

            • riku_iki 14 days ago

              in this case (archive.is) they have stronger case, since many people who potentially could buy subscription read it on archive.is because extension user violated terms of subscription.

              Also, extension likely has also terms of usage prohibiting uploading copyrighted content shifting liability on users.

            • sam0x17 14 days ago


              They went after seeders

              • riku_iki 13 days ago

                downloaders also received legal letters.

      • flerovium 14 days ago

        But what is the relationship between archive.is and the user who installed the extension?

        • phneutral26 14 days ago

          The user helps free the Internet by using archive.is as an openly accessible backup platform.

        • inconceivable 14 days ago

          dude... haha it's a random person on the internet who is doing it for free.

        • lcnPylGDnU4H9OF 14 days ago

          They (archive.is) would have built the extension to send the current page content to their servers and the user would have installed it so they can archive internet pages. https://help.archive.org/help/save-pages-in-the-wayback-mach... (item 2)

          • Stagnant 14 days ago

            You are confusing archive.is with archive.org. Although archive.is does have an extension[1] it doesn't appear to capture any of the page contents, it just simply sends the url for archive.is to crawl.

            1: https://chrome.google.com/webstore/detail/archive-page/gcaim...

            • lcnPylGDnU4H9OF 14 days ago

              I wasn't exactly confusing them but yeah, I did link to an archive.org article. I was having difficulty finding something specific to archive.is.

              I think the distinction between the two is moot in this post. The question could very well have been "How does archive.org bypass paywalls?" Though it's interesting that archive.is seems to just crawl the URL. Indeed that means they wouldn't necessarily be able to bypass the paywall.

janejeon 14 days ago

> If it identifies itself as archive.is, then other people could identify themselves the same way.

Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.

  • lazzlazzlazz 14 days ago

    It would be far better and more secure for archive.is to publish a public key on its site and then sign requests from its private key, which sites could optionally verify.

    • sublinear 14 days ago

      You just described client certificate auth

    • facile 14 days ago

      +1 on this!

  • flerovium 14 days ago

    In theory, this might work. But is it true? Do lots of sites have an archive.is whitelist?

    • arbitrage 14 days ago

      I really don't see why they would, if they're using a paywall in the first place.

Miner49er 14 days ago
armchairhacker 14 days ago

A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.

There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.

Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.

alex_young 14 days ago

Not specifically related to archive.is, but news sites have a tightrope to walk.

They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.

Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.

retrocryptid 14 days ago

Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.

Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.

Or maybe they're okay with letting the archive index their content...

  • wolverine876 14 days ago

    Doesn't almost every site on the web know exactly what the Google bot looks like?

  • peter422 14 days ago

    Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.

    • Aachen 14 days ago

      Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.

w1nst0nsm1th 14 days ago

If the people who know that tell you, they could lose access to said ressources.

But it's kind of an open secret, you just don't look in the right place.

thallosaurus 13 days ago

I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)


xiekomb 14 days ago

I thought they used this browser extension: https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean

  • flerovium 14 days ago

    That extension does work, but do we know they use it?

    • marcod 14 days ago

      They don't always use it, because I can archive a new page from my mobile phone browser, which doesn't even support extensions.

      My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.

  • DrDentz 14 days ago

    It's actually the opposite, for some news sites this extension links to archive.is because that's the only known way to bypass the paywall.

    • nora-puchreiner 14 days ago

      There are known ways to bypass paywall which are just impossible to implement within a browser extension while trivial on 12ft or archive. For example, to use Ukrainian residential proxy as some news websites granted free access from.

jrochkind1 14 days ago

Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.

But I don't know the answer either.

Yujf 14 days ago

I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik

  • strunz 14 days ago

    12ft.io also doesn't work or is disabled for many sites that archive.is still works on

    • hda111 13 days ago

      Maybe because the creator of 12ft.io isn't anonymous

  • janejeon 14 days ago

    Wouldn't sites be able to see that requests from 12ft.io isn't coming from Google's IPs?

    • dpifke 14 days ago


      Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...

      You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json

      • rahimnathwani 14 days ago

        "Google recommends using reverse DNS to verify..."

        This is almost right. They recommend two steps:

        1. Use reverse DNS to find the hostname the IP address claims to have. (The IP address block owner can put any hostname in here, even if they don't own/control the domain.)

        2. Assuming the claimed hostname is on one of Google's domains, do a forward DNS lookup to verify that the original IP address is returned.

        The second step is the important one.

firexcy 13 days ago

My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.

shipscode 14 days ago

What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.

riffic 14 days ago

your browser usually downloads an entire article and certain elements are overlayed.

it's trivial to bypass most paywalls isn't it?

  • aidenn0 14 days ago

    Not for some (I think the Wall Street Journal). Apparently the AMP version of the page does work this way for WSJ though, which is how IA gets around the paywall.

not_your_vase 14 days ago

They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.

mr-pink 14 days ago

every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content

jwildeboer 14 days ago

It’s internet magic. <rainbowmagicsparkles.gif> ;)

jakedata 14 days ago

Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.