Launch HN: GPT Driver (YC S21) – End-to-end app testing in natural language

129 points by cschiller 8 months ago

Hey HN, we are Chris and Chris from MobileBoost (https://mobileboost.io/). We’re building GPT Driver, an AI-native approach to create and execute end-to-end (E2E) tests on mobile applications. Our solution allows teams to define tests in natural language and prevents test flakiness by taking a visual approach paired with LLM (Large Language Model) reasoning. This helps achieve E2E test coverage with a fraction of the usual effort.

You can watch a brief product walkthrough here: https://www.youtube.com/watch?v=5-Ge2fqdlxc

In terms of trying the product out: since the service is resource-intensive (we provide hosted virtual/real phone instances), we don't currently have a playground available. However, you can see some examples here https://mobileboost.io/showcases and book a demo of GPT Driver testing your app through our website.

Why we built this: working at previous startups and scaleups, we saw how as app teams grew, QA teams would struggle to ensure everything was still working. This caused tension between teams and resulted in bugs making it into production.

You’d expect automated tests to help, but these were a huge effort because only engineers could create the tests, and the apps themselves kept changing—breaking the tests regularly and leading to high maintenance overhead. Functional tests often failed not because of actual app errors, but due to changes like copy updates or modifications to element IDs. This was already a challenge, even before considering the added complexities of multiple platforms, different environments, multilingual UIs, marketing popups, A/B tests, or minor UI changes from third-party authentication or payment providers.

We realized that combining computer vision with LLM reasoning could solve the common flakiness issues in E2E testing. So, we launched GPT Driver—a no-code editor paired with a hosted emulator/simulator service that allows teams to set up test automation efficiently. Our visual + LLM reasoning test execution reduces false alarms, enabling teams to integrate their E2E tests into their CI/CD pipelines without getting blocked. Some interesting technical challenges we faced along the way: (1) UI Object Detection from Vision Input: We had to train object detection models (YOLO and Faster R-CNN based) on a subset of the RICO dataset as well as our own dataset to be able to interact accurately with the UI. (2) Reasoning with Current LLMs: We have to shorten instructions, action history, and screen content during runtime for better results, as handling large amounts of input tokens remains a challenge. We also work with reasoning templates to achieve robust decision-making. (3) Performance Optimization: We optimized our agentic loop to make decisions in less than 4 seconds. To reduce this further, we implemented caching mechanisms and offer a command-first approach, where our AI agent only takes over when the command fails.

Since launching GPT Driver, we’ve seen adoption by technical teams, both with and without dedicated QA roles. Compared to code-based tests, the core benefit is the reduction of both the manual work and the time required to maintain effective E2E tests. This approach is particularly powerful for apps which have a lot of dynamic screens and content such as Duolingo which we have been working with since a couple of months. Additionally, the tests can now also be managed by non-engineers.

We’d love to hear about your experiences with E2E test automation—what approaches have worked or didn’t work for you? What features would you find valuable?

msoad 8 months ago

I work in this space. We manage thousands of e2e tests. The pain has never been in writing the tests. Frameworks like Playwright are great at the UX. And having code editors like Cursor makes it even easier to write the tests. Now, if I could show Cursor the browser, it would be even better, but that doesn’t work today since most multimodal models are too slow to understand screenshots.

It used to be that the frontend was very fragile. XVFB, Selenium, ChromeDriver, etc., used to be the cause of pain, but recently the frontend frameworks and browser automation have been solid. Headless Chrome hardly lets us down.

The biggest pain in e2e testing is that tests fail for reasons that are hard to understand and debug. This is a very, very difficult thing to automate and requires AGI-level intelligence to really build a system that can go read the logs of some random service deep in our service mesh to understand why an e2e test fails. When an e2e test flakes, in a lot of cases we ignore it. I have been in other orgs where this is the case too. I wish there was a system that would follow up and generate a report that says, “This e2e test failed because service XYZ had a null pointer exception in this line,” but that doesn’t exist today. In most of the companies I’ve been at, we had complex enough infra that the error message never makes it to the frontend so we can see it in the logs. OpenTelemetry and other tools are promising, but again, I’ve never seen good enough infra that puts that all together.

Writing tests is not a pain point worth buying a solution for, in my case.

My 2c. Hopefully it’s helpful and not too cynical.

hn_throwaway_99 8 months ago

While I agree with your primary pain point, I would argue that that really isn't specific to tests at all. It sounds like what you're really saying is that when something goes wrong, it's really difficult to determine which component in a complex system is responsible. I mean, from what you've described (and from what I've experienced as well), you would have the same if not harder problem if a user experienced a bug on the front end and then you had to find the root cause.
That is, I don't think a framework focused on front end testing should really be where the solution for your problem is implemented. You say "This is a very, very difficult thing to automate and requires AGI-level intelligence to really build a system that can go read the logs of some random service deep in our service mesh to understand why an e2e test fails." - I would argue what you really need is better log aggregation and system tracing. And I'm not saying this to be snarky (at scale with a bunch of different teams managing different components I've seen that it can be difficult to get everyone on the same aggregation/tracing framework and practices), but that's where I'd focus, as you'll get the dividends not only in testing but in runtime observability as well.
- Lienetic 8 months ago
  
  Agreed. Is there a good tool you'd recommend for this?
  
  hn_throwaway_99 8 months ago
  
  It's been quite some time but New Relic is a popular observability tool whose primary goal (at least the original primary goal I'd say) is being able to tie together lots of distributed systems to make it easier to do request tracing and root cause analysis. I was a big fan of New Relic when I last used it, but if memory serves me correctly it was quite expensive.
- krainboltgreene 8 months ago
  
  "OpenTelemetry and other tools are promising, but again, I’ve never seen good enough infra that puts that all together."
  It's a two paragraph comment and you somehow missed it.
  
  hn_throwaway_99 8 months ago
  
  I did read it, and I don't understand why you feel the need to be an asshole.
  Like I said in my comment, I do think getting everyone on the same page in a large, diverse organization is difficult. That said, it's not rocket science, and it's usually difficult because there aren't organizational incentives in place to actually ensure teams prioritize making system-wide observability work.
  FWIW, the process I've seen at more than 1 company is that people bitch about debugging being a pain, they put in a couple half measures to improve things, and then finally it becomes so much of a pain that they say "fine, we need to get all of our ducks in a row", execs make it a priority, and then they finally implement a system-wide observability process that works.
  
  msoad 8 months ago
  
  Exactly! I've never seen a 5000+ eng org that have all their ducks in a row when it comes to telemetry. it's one of those things that you can't put a team in charge of it and get results. everyone have to be on the same page which in a big org is hardly the case.
ec109685 8 months ago

There are silly things that trip up e2e tests like a cookie pop up or network failures and whatnot. An AI can plow through these in a way that a purely coded test can’t.
Those types of transient issues aren’t something that you would want to fail a test for given it still would let the human get the job done if it happened in the field.
This seems like the most useful part of adding AI to e2e tests. The world is not deterministic, which AI handles well.
Uber takes this approach here: https://www.uber.com/blog/generative-ai-for-high-quality-mob...
- tomatohs 8 months ago
  
  I predict an all out war over deterministic vs non-deterministic testing, or at least a new buzzword for fuzzy testing. Product people understand that a cookie banner "shouldn't" prevent the test from passing, but an engineer would entirely disagree (see the rest of the convos below).
  Engineers struggle with non-deterministic output. It removes the control and "truth" that engineering is founded upon. It's going to take a lot of work (or again, a toung-in-cheek buzzword like "chaos testing") to get engineers to accept the non-deterministic behavior.
cschiller 8 months ago

Thanks for your thoughtful response! Agree that digging into the root cause of a failure, especially in complex microservice setups, can be incredibly time-consuming.
Regarding writing robust e2e tests, I think it really depends on the team's experience and the organization’s setup. We’ve found that in some organizations—particularly those with large, fast-moving engineering teams—test creation and maintenance can still be a bottleneck due to the flakiness of their e2e tests.
For example, we’ve seen an e-commerce team with 150+ mobile engineers struggle to keep their functional tests up-to-date while the company was running copy and marketing experiments. Another team in the food delivery space faced issues where unrelated changes in webviews caused their e2e tests to fail, making it impossible to run tests in a production-like system.
Our goal is to help free up that time so that teams can focus on solving bigger challenges, like the debugging problems you’ve mentioned.
- Terretta 8 months ago
  
  Integrate with https://www.honeycomb.io
fullstackchris 8 months ago

To be fair, this is NOT the case with native mobile apps. There are some projects like detox that are trying to make e2e tests easier, but the tests themselves can be painful, run fairly slow on emulators, etc.
Maybe someday the tooling for mobile will be as good as headless chrome is for web :)
Agreed though that the followup debugging of a failed test could be hard to automate in some cases.
- edelans 8 months ago
  
  I think we can claim that at Waldo.
  Check for yourself: I've just recorded this [1] scripted test on the wikipedia mobile app, and it yields this [2] Replay. In less than a minute we spin up a fresh virtual device, install your app on it, execute the 8 steps of the script.
  As a result, you get the Replay of the session : video synchronized with interaction timeline, device & network logs, so you can debug in full context.
  [1]: https://github.com/waldoapp/waldo-programmatic-samples/blob/... [2]: https://share.waldo.com/7a45b5bd364edbf17c578070ce8bde220240...
  
  egeek 8 months ago
  
  Do you have any pricing info available? All I can see is get started for free, but no info on what it might cost later
rafaelmn 8 months ago

I think either you're overselling the maturity of the ecosystem or I've been unfortunate enough to get stuck with the worst option out there - Cypress. I run into tooling limitations and issues regularly, only to eventually find an open GitHub issue with no solution or some such.
codedokode 8 months ago

Sorry if it is a stupid idea, but cannot you log all messages to a separate file for each test (or attach test id to the messages)? Then if the test fails, you can see where the error occured.
- msoad 8 months ago
  
  Where I work there are 1,500 microservices. How do I get a log of all of those services -- only related to my test's requests in a file?
  I know there are solutions for this, but in the real world I have not seen it properly implemented.
  
  antonvs 8 months ago
  
  This works easily enough in the major cloud environments, since logging tends to be automatic and centralized. The only thing you need to do is make sure that a common request id or similar propagates to all the services, which is not that difficult.
  
  ergeysay 8 months ago
  
  As you said, OpenTelemetry and friends can help. I had great success with these.
  I am curious, what were implementation issues you have encountered?
TechDebtDevin 8 months ago

I doubt that screenshot methods are the bottleneck considering that's the method Microsoft and Anthropic are using.
- tomatohs 8 months ago
  
  It's absolutely not the bottleneck. OpenAI can process a full resolution screenshot in about 4 seconds.
tomatohs 8 months ago

You're totally right here, but "debugging failed tests" is a mature problem that assumes you have working tests and people to write them. Most companies don't have the resources to dedicate full engineer time to QA, and if they do nobody maintains the test.
Debugging failed test is a "first world problem"
- AdieuToLogic 8 months ago
  
  > ... "debugging failed tests" is a mature problem that assumes you have working tests and people to write them.
  I am reminded of an old s/w engineering law:
  Developers can test their solution or Customers will. Either way, the system will be tested.

batikha 8 months ago

Very cool! I already can see a lot of "this is already solved by playwright/cypress/selenium/deterministic stuff" in the comments.

Over nearly 10 years in startups (big and small), I've been consistently surprised by how much I hear that "testing has been solved", yet I see very little automation in place and PMs/QAs/devs and sometimes CEOs and VPs doing lots of manual QA. And not only on new features (which is a good thing), also on happy path / core features (arguably a waste of time to test things over and over again).

More than once I worked for a company that was against having a manual QA team, out of principle and more or less valid reasons (we use a typed language so less bug, engineers are empowered, etc etc), but ended up hiring external consultants to handle QA after a big quality incident.

The amount of mismatch between theory and practice in this field is impressive.

epolanski 8 months ago

> yet I see very little automation in place and PMs/QAs/devs and sometimes CEOs and VPs doing lots of manual QA
Because software is a clownish mimicking of engineering that lacks any real solid and widespread engineering practices.
It's cultural.
Crowds boast their engineering degrees, but have little to show but leetcode and system design black belts, even though their day to day job rarely requires them to architect systems or reimplement a new Levehnstein distance but would benefit a lot from thoroughly investigating functional and non functional requirements and encoding and maintaining those through automation.
There's very little engineering in software, people really care about the borderline fun parts and discard the rest.
cschiller 8 months ago

Thanks for sharing your experience! Completely agree - there's often a huge gap between the perception that testing is "solved" and the reality of manual QA still being necessary, even for core features. We recently had a call with one of the largest US mobile teams and were surprised to learn they're still doing extensive manual testing because some use cases remain uncovered by traditional tools. It's definitely not as "solved" as many might think.

ec109685 8 months ago

> In terms of trying the product out: since the service is resource-intensive (we provide hosted virtual/real phone instances), we don't currently have a playground available. However, you can see some examples here https://mobileboost.io/showcases and book a demo of GPT Driver testing your app through our website.

Have you considered an approach like what Anthropic is doing for their computer control where an agent runs on your own computer and controls a device simulator?

ec109685 8 months ago

Or even the actual device on the latest Mac OS.

codepathfinder 8 months ago

I've been a mobile developer for the past 10 years and my overall belief is that mobile app development has slower growth and companies with the mobile team are investing less on mobile Dev or testing+tooling+education. Do you think the market is still hot once it was to use your product?

cschiller 8 months ago

I would say that mobile apps are still the primary format for launching new consumer services, incl. new apps like ChatGPT and many others. However we’ve observed that teams are expected to do more with less—delivering high-quality products while ensuring compliance, often with the same or even smaller team sizes. This is why we focus on minimizing the engineering burden, particularly when it comes to repetitive tasks like regression testing, which can be especially painful to maintain in the mobile ecosystem due to use of third-party integrations (authentication, payments, etc.).
- codetrotter 8 months ago
  
  > mobile apps are still the primary format for launching new consumer services, incl. new apps like ChatGPT and many others
  OpenAI launched ChatGPT to the public on the web first and it took like, several months I think from I used their public web version until they had an official app for it in App Store. In the meantime, some third party apps popped up in App Store for using ChatGPT. I kept using the web version until the official app showed up. And probably having the mobile app in App Store has helped them grow to the number of users they have now. But IMO, ChatGPT as a product was not itself “launched” on App Store and they seemed to do very well in terms of adoption even when initially they only had the web version. The main point, that mobile apps are still desired, I agree with though.

codepathfinder 8 months ago

Is it possible to record the user screen and just generate a test case. I believe that's most efficient way IMO

cschiller 8 months ago

Yes, great point! We have an 'Assistant' feature where you can perform the flow on the device, and we automatically generate the test case as you navigate the app. As you mentioned, it’s a great starting point to quickly automate the functional flow. Afterwards, you can add more detailed assertions as needed. Technically we do this by using both the UI hierarchy from the app as well as vision models to generate the test prompt.
tomatohs 8 months ago

This comes up all the time. It seems like it would be possible, but imagine the case where you want to verify that a menu shows on hover. Was the hover on the menu intentional?
Another example, imagine an error box shows up. Was that correct or incorrect?
So you need to build a "meta" layer, which includes UI, to start marking up the video and end up in the same state.
Our approach has been to let the AI explore the app and come up with ideas. Less interaction from the user.
- codepathfinder 8 months ago
  
  My way of thinking while working of B2 enterprise app, sometimes users come up from weird scenarios in feature with X turn on, off with specific edition (country).
  Maybe the gpt can surf the user activity logs or crash logs and reproduce the scenarios as test case.
  Remember crashlytics ?

rvz 8 months ago

How does this compare to Robin by mobile.dev; the same guys that built Maestro? [0]

That has around 95% of what GPT Driver does and has the potential to do Web E2E testing.

[0] https://maestro.mobile.dev

cschiller 8 months ago

One of our customers recently compared GPTD with Maestro’s Robin (formerly App Quality CoPilot). Their mobile platform engineering manager highlighted three key reasons for choosing us: lack of frustration, ease of implementation, and reliability.
To be more concrete their words were: - “What you define, you can tweak, touch the detail, and customize, saving you time.” - “You don’t entirely rely on AI. You stay involved, avoiding misinterpretations by AI.” - “Flexibility to refine, by using templates and triggering partial tests, features that come from real-world experience. This speeds up the process significantly.”
Our understanding is that because we launched the first version of GPT Driver in April 2023, we’ve built it in an “AI-native” way, while other tools are simply adding AI-based features on top. We worked closely with leading mobile teams, including Duolingo, to ensure we stay as aligned as possible with real-world challenges.
While our focus is on mobile, GPT Driver also works effectively on web platforms.

mmaunder 8 months ago

Congrats! How has Anthropic's latest release supporting computer use affected your planning/thinking around this?

PS:If you had this for desktop we'd immediately become a customer.

cschiller 8 months ago

Thank you! Sonnet 3.5 is indeed a powerful model, and we're actually using it. However, even with the latest version, there are still some limitations affecting our specific use case. For instance, the model struggles to accurately recognize semi-overlaid areas, such as popups that block interactions, and it has trouble consistently detecting when UI elements are in a disabled state.
To address these issues, we enhance the models with our own custom logic and specialized models, which helps us achieve more reliable results.
Looking forward, we expect our QA Studio to become even more powerful as we integrate tools like test management, reporting, and infrastructure, especially as models improve. We're excited about the possibilities ahead!
- edelans 8 months ago
  
  Hi cschiller, I think we can help you with those issues at Waldo. I guess you are using Appium under the hood to get the UI hierarchy. At Waldo we developed a competing (proprietary) engine that solves a lot of Appium problems.
  We provide the most accurate view hierarchy for mobile apps (including React Native and Flutter apps), and we do it under 500ms for each view.
  I would love to get in touch: at e.de-lansalut [at] tricentis.com
  Here is an example of what we are able to do: https://share.waldo.com/7a45b5bd364edbf17c578070ce8bde220240...
tomatohs 8 months ago

We do AI E2E desktop, sent you an email.

drothlis 8 months ago

I noticed in your demo it generated the prompt "tap on the 'Log in' button located directly below the 'Facebook Password' field".

Does your model consistently get the positions right? (above, below, etc). Every time I play with ChatGPT, even GPT-4o, it can't do basic spatial reasoning. For example, here's a typical output (emphasis mine):

> If YouTube is to the upper *left* of ESPN, press "Up" once, then *"Right"* to move the focus.

(I test TV apps where the input is a remote control, rather than tapping directly on the UI elements.)

xyst 8 months ago

I remember testing out a similar product (mabl?). Ended up just using it to check for dead links. Using it for other use cases, I remember getting too many false positives for other use cases.

This was many years ago though (2018-2019?) before the genAI craze. Wonder if it has improved or not; or if this product is any better than its competitors.

pj_mukh 8 months ago

This is super cool. As a question, are the instructions re-generated from the instruction tokens everytime. While maybe costly, this feels like it would be robust to small changes in the app (and component name changes etc.). Does that make sense?

chrtng 8 months ago

Great question! Yes, GPT Driver runs according to the test prompt each time, which makes it resilient to small changes. To speed up execution, we also use a caching mechanism that runs quickly if nothing has changed, and only uses the models when needed.

tomatohs 8 months ago

Curious what happened to the other YC Mobile AI E2E company, CamelQA (YC W24). They pivoted to AI assistants. Could be good lessons there if you're not already in touch with them.

bluelightning2k 8 months ago

Genuinely curious, is the timing on this immediately after Claude computer use a coincidence? Or was that like the last missing piece, or a kind of threat which expedited things

cschiller 8 months ago

Good call! The timing was actually a coincidence, but not unexpected. OpenAI had already announced their plans to work on a desktop agent, so it was only a matter of time.
From our tests, even the latest model snapshots aren't yet reliable enough in positional accuracy. That's why we still rely on augmenting them with specialized object detection models. As foundational models continue to improve, we believe our QA suite - covering test case management, reporting, agent orchestration, and infrastructure - will become more relevant for the end user. Exciting times ahead!

doublerebel 8 months ago

How does this compare with Test.ai (now aka Testers.ai) who have offered basically this same service for the last 5 years?

tauntz 8 months ago

Totally offtopic but I looked at testers.ai and noticed the following from the terms of service:
> Individuals with the last name "Bach" or "Bolton" are prohibited from using, referencing, or commenting on this website or any of its content.
..and now I'm curious to know the backstory for this :)
- LeFever 8 months ago
  
  John Bolton and James Bach are the founders of RST [1] and generally big names in the “formal” software testing space. Presumably the testers.ai folks aren’t fans. :p
  [1] https://rapid-software-testing.com/authors/

archerx 8 months ago

Curious question, what ever happened with the OpenAI drama with trademarking “GPT”. I’m guessing they were not successful?

chrtng 8 months ago

From what we understand the term GPT was deemed too general for OpenAI to claim as its own.
https://www.theverge.com/2024/2/16/24075304/trademark-pto-op...
- archerx 8 months ago
  
  Thank you.

alexwordxxx 8 months ago

Hey https://google.com

101008 8 months ago

Still interesting how a lot of companies offer a LLM (non-deterministic) solution for deterministic problems.

chairhairair 8 months ago

This fundamental issue seems to be totally lost on the LLM-heads.
I do not want additional uncertainty deep in the development cycle.
I can tolerate the uncertainty while I'm writing. That's where there is a good fit for these fuzzy LLMs. Anything past the cutting room floor and you are injecting uncertainty where it isn't tolerable.
I definitely do not want additional uncertainty in production. That's where the "large action model" and "computer use" and "autonomous agent" cases totally fall apart.
It's a mindless extension something like: "this product good for writing... let's let it write to prod!"
- usernameis42 8 months ago
  
  Same goes with the real people, we all can do mistakes and AI Agents would get better over time, and will be ahead of many specialist pretty soon, but probably not perfect before AGI, just as we are.
  
  layer8 8 months ago
  
  One of the advantages of automation has traditionally been that it cuts out the indeterminacy and variability inherent in real people.
  
  conorjh 8 months ago
  
  your software has real people in it?
  
  SkyBelow 8 months ago
  
  Ideally it does. Users, super users, admins, etc. Though one might point out exactly how much effort we put into locking down what they can do. I think one might be able to expand this to build up a persona for how LLMs should interface with software in production, but too many applications give them about the same level of access as a developer coding straight into production. Then again, how many company leaders would approve of that as well if they thought it would get things done faster and at lower cost?
aksophist 8 months ago

It’s only deterministic for each version of the app. Versions change: UI elements move, change their title slightly. Irrelevant promo popups appear, etc. For a deterministic solution, someone has to go and update the tests to handle all of that. Good ‘accessibility hygiene’ can help, but many apps lack that.
And then there are truly dynamic apps like games or simulators. There may be no accessibility info to deterministically code to.
- usernameis42 8 months ago
  
  There is great approach based on test-id strategy, basically it's a requirement for the frontend teams to cover all interactive elements with test-id's.
  It allows to make tests less flaky and writing them is increasing dramatically, also works with mobile as well, usually elements for the main flows doesn't change that often, you'll still need to update them.
  I did stable mobile UI tests with this approach as well, worked well
- digging 8 months ago
  
  > Versions change: UI elements move, change their title slightly
  Not randomly, I'd hope. I think you may be misunderstanding what deterministic means - or I am.
  
  MattDaEskimo 8 months ago
  
  It's crazy to have people so out of their league try to argue against well established meanings.
  A testing framework requires determinism. If something changes the team should know and adjust.
  AI could play a bit in easing this adjustment and tests but it's not a driver in these tests.
  
  minhaz23 8 months ago
  
  Ever worked with extjs? :/
cschiller 8 months ago

I agree that it can seem counterintuitive at first to apply LLM solutions to testing. However, in end-to-end testing, we’ve found that introducing a level of flexibility can actually be beneficial.
Take, for example, scenarios involving social logins or payments where external webviews are opened. These often trigger cookie consent forms or other unexpected elements, which the app developer has limited control over. The complexity increases when these elements have unstable identifiers or frequently changing attributes. In such cases, even though the core functionality (e.g., logging in) works as expected, traditional test automation often fails, requiring constant maintenance.
The key, as to other comments, is ensuring the solution is good at distinguishing between meaningful test issues and non issues.
worldsayshi 8 months ago

I would assume that the test runner translates the natural language instruction into a deterministic selector and only re-does that translation when the selector fails. At least that's how I would try to implement it..
- tomatohs 8 months ago
  
  This is the right idea and how we do it at TestDriver.ai. The deterministic selector still has about 20% fuzz matching rate, and if it fails it trys to recover.
devjab 8 months ago

I think it’s less of an issue for e2e testing because e2e testing sucks. If teams did it well in general you would be completely correct, but in many places a LLM will be better even if it hallucinates. As such I think there will be a decent market for products like this, even if they aren’t may not even really be testing what you think they are testing. Simply because that may well be way better than the e2e testing many places already do.
In many cases you’re correct though. We have a few libraries where we won’t use Typescript because even though it might transpire 99% correctly, the fact that we have to check, is too much work for it to be worth our time in those cases. I think LLMs are similar, once in a while you’re not going to want them because checking their work takes too much resources, but for a lot of stuff you can use them. Especially if your e2e testing is really just pseudo jobbing because some middle manager wanted it, which it unfortunately is far too often. If you work in such a place you’re going to recommend the path of least resistance and if that’s LLM powered then it’s LLM powered.
On the less bleak and pessimistic side, if the LLM e2e output is good enough to be less resource consuming, even if you have to go over it, then it’s still a good business case.
batikha 8 months ago

I work in the field and built a tool that has way less flakiness than deterministic solutions. The issue is testing environments are always imperfect because (a) they are stateful and (b) there's always some randomness in actual production software. Some teams have very clean testing environment but most don't.
So being non-deterministic is actually an advantage, in practice.
joshuanapoli 8 months ago

I think that the hope/dream here is to make end-to-end tests less flakey. It would be great to have navigation and assertions commands that are robust against simple changes in the app that aren't relevant to the test case.
- chairhairair 8 months ago
  
  It's just a dream then.
  It's completely at-odds with the strengths of LLMs (fuzzy associations, rough summaries, naive co-thinking).
  
  yorwba 8 months ago
  
  Fuzzy associations seem relevant? Interact with the UI based on what it looks like, not the specific implementation details.
  
  chairhairair 8 months ago
  
  No. Both of the requirements "to interact" and "based on what it looks like" require unshakable foundations in reality - which current models clearly do not have.
  They will inevitably hallucinate interactions and observations and therefore decrease reliability. Worse, they will inject a pervasive sense of doubt into the reliability of any tests they interact with.
  
  tomatohs 8 months ago
  
  > unshakable foundations in reality
  Yes, you are correct that it entirely lays in the reputation of the AI.
  This discussion leads to interesting question, which is "what is quality?"
  Quality is determined by perception. If we can agree that an AI is acting like a user and it can use your website, we can assume that a user can use your website and therefor it is "quality".
  For more, read "Zen and the Art of Motorcycle Maintenance"
dartos 8 months ago

Tbf, users are also non-deterministic, so if LLM testing like this does catch on, it’ll be in the same realm as chaos testing.

aksophist 8 months ago

how do you evaluate your tool, and have you published your evaluation along with the metrics?

chrtng 8 months ago

Thank you for your question! While we haven't published a formal evaluation yet, it's something we are working toward. Currently, we rely mostly on human reviews to monitor and assess LLM outputs. We also maintain a golden test suite that is run against every release to ensure consistency and quality over time, using regex-based evaluations.
Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!
- aksophist 8 months ago
  
  What is a false positive rate? Is it when the agent falsely passes or falsely “finds a bug”? And regardless of which: why don’t you include the other as a key metric?
  I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.
  What is your tool’s FPR on your golden suite?

iknownthing 8 months ago

no logo?

lihua919 8 months ago

interesting