Turn any website into an API

102 points by pcl 2 days ago

Aaargh20318 2 days ago

It’s a cute idea, but ultimately not very useful. An API is more than just an endpoint that gives easy to parse results. The most important part is that an API is a contract. An API implies that things won’t suddenly break without prior announcement. Any form of web-scraping, no matter how cleverly done, is inherently fragile. They can change their front-end for any reason which could break your scraper. As such you cannot rely on such an interface.

autonomousErwin 2 days ago

I wonder if not just checking the site every day (or minute ) would solve for this.
It's not necessarily the structure of the source data (the DOM, the HTML etc.) but rather the translator that needs to be contractually consistent. The translator in this case is the service for the endpoints.
- Aaargh20318 2 days ago
  
  > I wonder if not just checking the site every day (or minute ) would solve for this.
  No, because a webpage makes no promise to not change. Even if you check every minute, can your system handle random 1 minute periods of unpredictable behavior? What if they remove data? What if the meaning of the data changes (e.g. instead of a maximum value for some field they now show the average value) how would your system deal with that? What if they are running an A/B test and 10% of your ‘API’ requests return a different page?
  This is not a technical problem and the solution is not a technical one. You need to have some kind of relationship with the entity whose data you are consuming or be okay with the fact that everything can just stop working at any random moment in time.
  
  10000truths a day ago
  
  That's just part and parcel of relying on third parties - you should always price in the maintenance burden of keeping up with potential changes on their end. That burden is a lot lower if the third party cooperates with you and provides an explicit contract and backwards compatibility, but it's still not zero.
  
  Aaargh20318 a day ago
  
  It’s not about the maintenance cost, it’s about continuity of service. If you scrape a website things may break at any time. If you use a proper API and have a contract with the supplier you will have the opportunity to make any changes before things break.

thrdbndndn 2 days ago

I scrape website content regularly (usually as one-offs) and have a hand-crafted extractor template where I just fill in a few arguments (mainly CSS selectors and some options) to get it working quickly. These days, I do sometimes ask AI to do this for me by giving it the HTML.

The issue is that for any serious use of this concept, some manual adjustment is almost always needed. This service says, "Refine your scraper at any time by chatting with the AI agent," but from what I can tell, you can't actually see the code it generates.

Relying solely on the results and asking the AI to tweak them can work, but often the output is too tailored to a specific page and fails to generalize (essentially "overfitting.") And surprisingly, this back-and-forth can be more tedious and time-consuming than just editing a few lines of code yourself. Also if you can't directly edit the code behind the scenes, there are situations where you'll never be able to get the exact result you want, no matter how much you try to explain it to the AI in natural language.

throwup238 2 days ago

I’ve had no shortage of trouble using LLMs for scrapers because for some reason they almost always ignore my instructions to use something other than the class name for selectors. They love to use the hashed class (like emotion/styled/whatever css-in-js library de jour) names that change way too often.

myflash13 2 days ago

Way too little information on the homepage. Does this handle pagination? What about sites behind authentication? I assume the generated API is stable, i.e. the shape of the JSON will not change after a scraper is built, but what if the site changes it's DOM, does the scraper need to be regenerated? Does this attempt to defeat anti-bot and anti-scraper walls like Cloudflare?

ExxKA 2 days ago

No no, its good that is simple to understand.
All those details can go in the docs / faqs section.
- slightwinder 2 days ago
  
  Where are those docs?

Joeboy 2 days ago

This is relevant to my interests[0]

Based on the website I was quite skeptical. It looks too much like an "indiehacker", minimum-almost-viable-product, fake-it-till-you-make-it, trolling-for-email-addresses kind of website.

But after a quick search on twitter, it seems like people are actually using it and reporting good results. Maybe I'll take a proper look at it at some point.

I'd still like to know more about pricing, how it deals with cloudflare challenges, non-semantic markup and other awkwardnesses.

[0] https://github.com/Joeboy/cinescrapers

vin047 2 days ago

No information on pricing on the site.

websiteapi 2 days ago

I'm surprised (and could be wrong), no one has made a chrome extension that just controls a page and exposes the output to localhost for consumption as an API. Similar to using chrome web driver, but without the setup.

ExxKA 2 days ago

Isnt that basically what browser-use is?
- kevindamm 2 days ago
  
  I kind of agree and don't. You could say HTTP+DOM is the API, we're already there. But it lacks the structure and a more explicit regularity (in part because it's meant for human consumption, not programming). And if you were to describe the whole protocol (including CSS and JS as they can change ordering, even content, of what's shown) it's incredibly more complicated than the equivalent, distilled representation.
  There are efforts going back at least fifteen years to extract ontologies from natural language [0] and HTML structure [1].
  [0]: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d... (2010) [PDF]
  [1]: https://doi.org/10.1016/j.dss.2009.02.011 (2009)

maticzav 2 days ago

i love the idea!

i know that https://expand.ai/ is doing something similar, maybe worth checking out

ExxKA 2 days ago

I really like the simplicity of the offering. The website looks great (to a human) and explains the API idea very simply. Good stuff!

Jotalea 2 days ago

It says that the backend is down, I guess I'll have to wait. Hope I don't forget about it before.

runningmike 2 days ago

Nice idea. In practice many sites have different methods to prevent scraping. Large risk on doing things manually imho.

renegat0x0 2 days ago

Huh, I I have been working on solution to that problem.
My project allows to define rules for various sites, so eventually everything is scraped correctly. For YouTube yet dlp is also used to augment results.
I can crawl using requests, selenium, Httpx and others. Response is via json so it easy to process.
The downside is that it may not be the fastest solution, and I have not tested it against proxies.
https://github.com/rumca-js/crawler-buddy

artluko 2 days ago

I saw your video on youtube really impressive

with 2 days ago

pretty cool idea. using stagehand under the hood?

p3rls 2 days ago

It's great being an independent site in 2025.

You get fucked by google promoting AIOs and hindustantimes articles for everything in your niche then these scrapers knocking your server offline on the other.

verelo 2 days ago

Mobile ux is completely broken. This would be a 5 min fix with Claude and cursor. Signals to Me that i can expect the backend to struggle with anything basic like a captcha etc.