It’s a cute idea, but ultimately not very useful. An API is more than just an endpoint that gives easy to parse results. The most important part is that an API is a contract. An API implies that things won’t suddenly break without prior announcement. Any form of web-scraping, no matter how cleverly done, is inherently fragile. They can change their front-end for any reason which could break your scraper. As such you cannot rely on such an interface.
I wonder if not just checking the site every day (or minute ) would solve for this.
It's not necessarily the structure of the source data (the DOM, the HTML etc.) but rather the translator that needs to be contractually consistent. The translator in this case is the service for the endpoints.
> I wonder if not just checking the site every day (or minute ) would solve for this.
No, because a webpage makes no promise to not change. Even if you check every minute, can your system handle random 1 minute periods of unpredictable behavior? What if they remove data? What if the meaning of the data changes (e.g. instead of a maximum value for some field they now show the average value) how would your system deal with that? What if they are running an A/B test and 10% of your ‘API’ requests return a different page?
This is not a technical problem and the solution is not a technical one. You need to have some kind of relationship with the entity whose data you are consuming or be okay with the fact that everything can just stop working at any random moment in time.
That's just part and parcel of relying on third parties - you should always price in the maintenance burden of keeping up with potential changes on their end. That burden is a lot lower if the third party cooperates with you and provides an explicit contract and backwards compatibility, but it's still not zero.
It’s not about the maintenance cost, it’s about continuity of service. If you scrape a website things may break at any time. If you use a proper API and have a contract with the supplier you will have the opportunity to make any changes before things break.
I scrape website content regularly (usually as one-offs) and have a hand-crafted extractor template where I just fill in a few arguments (mainly CSS selectors and some options) to get it working quickly. These days, I do sometimes ask AI to do this for me by giving it the HTML.
The issue is that for any serious use of this concept, some manual adjustment is almost always needed. This service says, "Refine your scraper at any time by chatting with the AI agent," but from what I can tell, you can't actually see the code it generates.
Relying solely on the results and asking the AI to tweak them can work, but often the output is too tailored to a specific page and fails to generalize (essentially "overfitting.") And surprisingly, this back-and-forth can be more tedious and time-consuming than just editing a few lines of code yourself.
Also if you can't directly edit the code behind the scenes, there are situations where you'll never be able to get the exact result you want, no matter how much you try to explain it to the AI in natural language.
I’ve had no shortage of trouble using LLMs for scrapers because for some reason they almost always ignore my instructions to use something other than the class name for selectors. They love to use the hashed class (like emotion/styled/whatever css-in-js library de jour) names that change way too often.
Way too little information on the homepage. Does this handle pagination? What about sites behind authentication? I assume the generated API is stable, i.e. the shape of the JSON will not change after a scraper is built, but what if the site changes it's DOM, does the scraper need to be regenerated? Does this attempt to defeat anti-bot and anti-scraper walls like Cloudflare?
Based on the website I was quite skeptical. It looks too much like an "indiehacker", minimum-almost-viable-product, fake-it-till-you-make-it, trolling-for-email-addresses kind of website.
But after a quick search on twitter, it seems like people are actually using it and reporting good results. Maybe I'll take a proper look at it at some point.
I'd still like to know more about pricing, how it deals with cloudflare challenges, non-semantic markup and other awkwardnesses.
I'm surprised (and could be wrong), no one has made a chrome extension that just controls a page and exposes the output to localhost for consumption as an API. Similar to using chrome web driver, but without the setup.
I kind of agree and don't. You could say HTTP+DOM is the API, we're already there. But it lacks the structure and a more explicit regularity (in part because it's meant for human consumption, not programming). And if you were to describe the whole protocol (including CSS and JS as they can change ordering, even content, of what's shown) it's incredibly more complicated than the equivalent, distilled representation.
There are efforts going back at least fifteen years to extract ontologies from natural language [0] and HTML structure [1].
Huh, I
I have been working on solution to that problem.
My project allows to define rules for various sites, so eventually everything is scraped correctly.
For YouTube yet dlp is also used to augment results.
I can crawl using requests, selenium, Httpx and others. Response is via json so it easy to process.
The downside is that it may not be the fastest solution, and I have not tested it against proxies.
You get fucked by google promoting AIOs and hindustantimes articles for everything in your niche then these scrapers knocking your server offline on the other.
Mobile ux is completely broken. This would be a 5 min fix with Claude and cursor. Signals to
Me that i can expect the backend to struggle with anything basic like a captcha etc.
It’s a cute idea, but ultimately not very useful. An API is more than just an endpoint that gives easy to parse results. The most important part is that an API is a contract. An API implies that things won’t suddenly break without prior announcement. Any form of web-scraping, no matter how cleverly done, is inherently fragile. They can change their front-end for any reason which could break your scraper. As such you cannot rely on such an interface.
I wonder if not just checking the site every day (or minute ) would solve for this.
It's not necessarily the structure of the source data (the DOM, the HTML etc.) but rather the translator that needs to be contractually consistent. The translator in this case is the service for the endpoints.
> I wonder if not just checking the site every day (or minute ) would solve for this.
No, because a webpage makes no promise to not change. Even if you check every minute, can your system handle random 1 minute periods of unpredictable behavior? What if they remove data? What if the meaning of the data changes (e.g. instead of a maximum value for some field they now show the average value) how would your system deal with that? What if they are running an A/B test and 10% of your ‘API’ requests return a different page?
This is not a technical problem and the solution is not a technical one. You need to have some kind of relationship with the entity whose data you are consuming or be okay with the fact that everything can just stop working at any random moment in time.
That's just part and parcel of relying on third parties - you should always price in the maintenance burden of keeping up with potential changes on their end. That burden is a lot lower if the third party cooperates with you and provides an explicit contract and backwards compatibility, but it's still not zero.
It’s not about the maintenance cost, it’s about continuity of service. If you scrape a website things may break at any time. If you use a proper API and have a contract with the supplier you will have the opportunity to make any changes before things break.
I scrape website content regularly (usually as one-offs) and have a hand-crafted extractor template where I just fill in a few arguments (mainly CSS selectors and some options) to get it working quickly. These days, I do sometimes ask AI to do this for me by giving it the HTML.
The issue is that for any serious use of this concept, some manual adjustment is almost always needed. This service says, "Refine your scraper at any time by chatting with the AI agent," but from what I can tell, you can't actually see the code it generates.
Relying solely on the results and asking the AI to tweak them can work, but often the output is too tailored to a specific page and fails to generalize (essentially "overfitting.") And surprisingly, this back-and-forth can be more tedious and time-consuming than just editing a few lines of code yourself. Also if you can't directly edit the code behind the scenes, there are situations where you'll never be able to get the exact result you want, no matter how much you try to explain it to the AI in natural language.
I’ve had no shortage of trouble using LLMs for scrapers because for some reason they almost always ignore my instructions to use something other than the class name for selectors. They love to use the hashed class (like emotion/styled/whatever css-in-js library de jour) names that change way too often.
Way too little information on the homepage. Does this handle pagination? What about sites behind authentication? I assume the generated API is stable, i.e. the shape of the JSON will not change after a scraper is built, but what if the site changes it's DOM, does the scraper need to be regenerated? Does this attempt to defeat anti-bot and anti-scraper walls like Cloudflare?
No no, its good that is simple to understand.
All those details can go in the docs / faqs section.
Where are those docs?
This is relevant to my interests[0]
Based on the website I was quite skeptical. It looks too much like an "indiehacker", minimum-almost-viable-product, fake-it-till-you-make-it, trolling-for-email-addresses kind of website.
But after a quick search on twitter, it seems like people are actually using it and reporting good results. Maybe I'll take a proper look at it at some point.
I'd still like to know more about pricing, how it deals with cloudflare challenges, non-semantic markup and other awkwardnesses.
[0] https://github.com/Joeboy/cinescrapers
No information on pricing on the site.
I'm surprised (and could be wrong), no one has made a chrome extension that just controls a page and exposes the output to localhost for consumption as an API. Similar to using chrome web driver, but without the setup.
Isnt that basically what browser-use is?
I kind of agree and don't. You could say HTTP+DOM is the API, we're already there. But it lacks the structure and a more explicit regularity (in part because it's meant for human consumption, not programming). And if you were to describe the whole protocol (including CSS and JS as they can change ordering, even content, of what's shown) it's incredibly more complicated than the equivalent, distilled representation.
There are efforts going back at least fifteen years to extract ontologies from natural language [0] and HTML structure [1].
[0]: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d... (2010) [PDF]
[1]: https://doi.org/10.1016/j.dss.2009.02.011 (2009)
i love the idea!
i know that https://expand.ai/ is doing something similar, maybe worth checking out
I really like the simplicity of the offering. The website looks great (to a human) and explains the API idea very simply. Good stuff!
It says that the backend is down, I guess I'll have to wait. Hope I don't forget about it before.
Nice idea. In practice many sites have different methods to prevent scraping. Large risk on doing things manually imho.
Huh, I I have been working on solution to that problem.
My project allows to define rules for various sites, so eventually everything is scraped correctly. For YouTube yet dlp is also used to augment results.
I can crawl using requests, selenium, Httpx and others. Response is via json so it easy to process.
The downside is that it may not be the fastest solution, and I have not tested it against proxies.
https://github.com/rumca-js/crawler-buddy
I saw your video on youtube really impressive
pretty cool idea. using stagehand under the hood?
It's great being an independent site in 2025.
You get fucked by google promoting AIOs and hindustantimes articles for everything in your niche then these scrapers knocking your server offline on the other.
Mobile ux is completely broken. This would be a 5 min fix with Claude and cursor. Signals to Me that i can expect the backend to struggle with anything basic like a captcha etc.