This is exactly what I'd want from an 'AI coding companion'.
Don't write or fix the code for me (thanks but I can manage that on my own with much less hassle), but instead tell me which places in the code look suspicious and where I need to have a closer look.
When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
ChatGPT is even less useful since it basically just spend a lot of time to tell me 'everything looking great yay good job high-five!'.
So far, traditional static code analysis has been much more helpful in finding actual bugs, but static analysis being clean doesn't mean there are no logic bugs, and this is exactly where LLMs should be able to shine.
If getting more useful potential-bugs-information from LLMs requires an extensively customized setup then the whole idea is getting much less useful - it's a similar situation to how static code analysis isn't used if it requires extensive setup or manual build-system integration instead of just being a button or menu item in the IDE or enabled by default for each build.
This is a point I see discussed surprisingly little. Given that many (most?) programmers like designing and writing code (excluding boilerplate), and not particularly enjoy reviewing code, it certainly feels backwards to make the AI write the code and relegate the programmer to reviewing it. (I know, of course, that the whole thing is being sold to stakeholders as "LoC machine goes brrrr" – code review? what's that?)
Creativity is fun. AIs automate that away. I want an AI that can do my laundry, fold it, and put it away. I don't need an AI to write code for me. I don't mind AI code review, it sometimes has a valid suggestion, and it's easy enough to ignore most of the rest of the time.
I was thinking this again just yesterday. Do my laundry correctly and get it put away. Organized my storage. Clean the bathroom. Do the dishes. Catalog my pantry, give me recipes, and keep it correctly stocked. Maybe I'm just a simple creature but like, these are the obvious problems in my life I'll pay to have go away so why are we taking away the fun stuff instead?
There are already cheap, domestic robots for cleaning dishes, cleaning the floor, cleaning clothes, making coffee, heating and cooling food, turning screws, drilling holes and so on. All those robots represent a greater than 90 percent (and sometimes a greater than 99 percent) savings in time relative to doing the same tasks manually. You still have to move the objects they operate on around within your house but that's mostly the only part of the task you have to do.
Unfortunately many things aren't dishwasher safe, some things don't fit in the dishwasher, and often certain types of food are not properly washed off in the dishwasher.
> All those robots represent a greater than 90 percent savings in time relative to doing the same tasks manually.
Lol, nope.
Dishwashers solve at best some 50% of the hassle that are the easy to wash table dishes, while being completely unable to clean oven ones. Floor cleaners solve a 5 minutes task in a couple-of-days-long house upkeep. Coffee makers... don't really automate anything, why did you list them here? And there's no automation available for heating and cooling food. And the part about drilling and turning screws also isn't automation at all.
The only thing on your list that is close to solved is clothes cleaning. And there's the entire ironing thing that is incredibly resistant to solving. But yeah, that puts it way beyond 90% solved.
As someone who played the roomba game quite a bit - you transfer the problem of vacuuming to the problem of very frequent robot cleaning. I've saved more time switching to a high powered central vac than I ever did with constantly cleaning the robot because I had the audacity to own a fluffy dog.
Also people claiming cleaning isn't "creative" or "fun". Steam has a whole genre of games simulating cleaning stuff because the act of cleaning is extremely fun and creative to a lot of people: https://store.steampowered.com/app/246900/Viscera_Cleanup_De... being a great example
Actually I do NOT want my robot to do my laundry for me! And because I'm garbage at painting and comparatively better at laundry, I DO want it to paint for me.
I've been developing with LLMs on my side for months/about a year now, and feels like it's allowing me to be more creative, not less. But I'm not doing any "vibe-coding", maybe that's why?
The creative parts (for me) is coming up with the actual design of the software, and how it all fits together, what it should do and how, and I get to do that more than ever now.
Same. I think there are two types of devs. Those that love designing the individual building blocks and those that wanna stack the blocks together to make something new.
At this point AI is best at the first thing and less good at the second. I like stacking blocks together. If I build a beautiful UI I don't enjoy writing the individual css code for every button but rather composing the big picture.
Not saying either is better or worse. But I can imagine that the people that loves to build the individual blocks like AI less because it takes away something they enjoy. For me it just takes away a step I had to do to get to the composing of the big picture.
The thing is, i love doing both. But there’s an actual rush of enjoyment when I finally figure one of the tenets of a system. It’s like solving a puzzle for me.
After that, it’s all became routine work as easy as drinking water. You explain the problem and I can quicly find the solution. Using AI at this point would be like herding cats. I already know what code to write, having a handful being suggested is distracting. Like feeling a a tune, and someone playing another melody other than the one you know.
> For me it just takes away a step I had to do to get to the composing of the big picture.
You can't successfully build the big picture on the sort of rotten foundation that AI produces though
I don't care how much you enjoy assembling building blocks over building the low level stuff, if you offload part of the building onto AI you're building garbage
The creative part for me includes both the implementation and the design, because the implementation also matters. The bots get in the way.
Maybe I would be faster if I paid for Claude Code. It's too expensive to evaluate.
If you like your expensive AI autocomplete, fine. But I have not seen any demonstrable and maintainable productivity gains from it, and I find understanding my whole implementation faster, more fun, and that it produces better software.
Maybe that will change, but people told me three years ago that we would be at the point today where I could not outdo the bot;
with all due respect, I am John Henry and I am still swinging my hammer. The steam pile driving machine is still too unpredictable!
> The creative part for me includes both the implementation and the design
The implementations LLMs end up writing are predicable, because my design locks down what it needs to do. I basically know exactly what they'll end up doing, and how, but it types faster than I do, that's why I hand it off while I go on to think about the next design iteration.
I currently send every single prompt to Claude, Codex, Qwen and Gemini (looks something like this: https://i.imgur.com/YewIjGu.png), and while the all most of the time succeed, doing it like this makes it clear that they're following what I imagined they'd do during the design phase, as they all end up with more or less the same solutions.
> If you like your expensive AI autocomplete
I don't know if you mean that in jest, but what I'm doing isn't "expensive AI autocomplete". I come up with what has to be done, the design for achieving so, then hand off the work. I don't actually write much code at all, just small adjustments when needed.
> and I find understanding my whole implementation faster
Yeah, I guess that's the difference between "vibe-coding" and what I (and others) are doing, as we're not giving up any understanding or control of the architecture and design, but instead focus mostly on those two things while handing off other work.
I've made great use of AI by keeping my boundaries clear and my requirements tight, and by rigorously ensuring I understand _every_ line of code I commit
I believe software development will transition to a role closer to director/reviewer/editor, where knowledge of programming paradigms are just as important as now, but also where _communication_ skills separate the good devs from the _great_ devs
The difference between a 1x dev and a 10x dev in future will be the latter knows how to clearly and concisely describe a problem or a requirement to laymen, peers, and LLMs alike. Something I've seen many devs struggle with today (myself included)
> but also where _communication_ skills separate the good devs from the _great_ devs
I think it has been that way since forever. If you look at all the great projects, it’s rare for the guy at the helm to not be a good communicator. And at corporate job, you soend a good chunk of the year writing stuff to people. Even the code you’re writing, you think abou the next person who’s going to read it.
Exactly. I loved doing novel implementations or abstractions… and the AI excels at the part where it modifies it slightly for different contexts… aka the boring stuff.
By grinding what though? I don't wanna grind "Entering characters with my fingers", I wanna grind "Does this design work for getting X to work as I want", which is exactly the sort of things LLMs help me move faster on.
And yes, if you're just using it as a slot machine, I understand it doesn't feel useful. But I don't think that's how most people use it, at least that's not how I use it.
depends on what abstraction level you enjoy being creative at.
Some people like creative coding, others like being creative with apps and features without much care to how it's implemented under the hood.
I like both, but IMO there is a much larger crowd for higher level creativity, and in those cases AIs don't automate the creativity away, they enable it!
Yes, because ideas are not worth much if anything. If you have an idea of a book, or a painting, and have someone else implement it, you have not done creative work. Literally, you have not created the work, brought it to existence. The creator has done the creativity.
I guess that depends on how much oversight you engage in. A lot of famous masters would oversee apprentices and step in for difficult tasks and to finish the work, yet we still attribute the work to those masters. Most of the work in science is done by graduate students, but we still attribute the lion's share of the credit to PIs.
Most software is developer tools and frameworks to manage electrical state in machines.
Such state management messes use up a lot of resources to copy around.
As an EE working in QA future chips with a goal of compressing away developer syntax art to preserve the least amount of state management possible to achieve maximum utility; sorry self selecting biology of SWEs, but also not sorry.
Above all this is capitalism not honorific obligationism. If hardware engineers can claim more of the tech economy for our shareholders, we must.
There are plenty of other creative outlets that are much less resource intensive. Rich first world programmers are a small subset of the population and can branch out then and explore life rather than believe everyone else has an obligation to conserve the personal story of a generation of future dead.
To me, it's the natural result of gaining popularity that enough people have started to use after the hype train rolled through and are now giving honest feedback. Real honest feedback can feel like a slap in the face when all you have had is overwhelming positive feedback from those aboard the hype train.
The writing has been on the wall with so called hallucinations where LLMs just make stuff up that the hype was way out over its skiis. The examples of lawyers being fined for unchecked LLM outputs being presented as fact type of stories will continue to take the shine off and hopefully some of the raw gungho nature will slow down a bit.
I saw an article today from the BBC where travellers are using LLMs to plan their vacations and getting into trouble going places (sometimes dangerously remote ones) to visit landmarks that don't even exist:
I'm mildly bearish on the human capacity to learn from its mistakes and have a feeling in my gut that we've taken a massive step backwards as civilization.
I could almost understand a lawyer working late the night before a brief is due and just run out of time to review the output of the LLM. How do you not look up travel destinations before heading out? That's just something I can't wrap my head around in any way of trying to be kind and seeing the other side of something
There are a lot of good AI code reviewers out there where they learn project conventions based on prior PRs and make rules from them. I've found they definitely save time and catch things I would have missed - things like cubic.dev or greptile etc etc. Especially helpful for running an open source project where code quality can have high variance and as a maintainer you may feel hesitant to be direct with someone -- the machine has no feelings so it is what it is :)
honestly? this but zoom out. machines are supposed to do the grunt work so that people can spend their time being creative and doing intangible, satisfying things but we seem to have built machines to make art, music and literature in order to free ourselves up to stack bricks and shovel manure.
> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
Here's a technique that often works well for me: When you get unexpectedly poor results, ask the LLM what it thinks an effective prompt would look like, e.g. "How would you prompt Claude Code to create a plan to effectively review code for logic bugs, ignoring things like FIXME and TODO comments?"
I've found this a really useful strategy in many situations when working with LLMS. It seems odd that it works, since one one think its ability to give a good reply to such a question means it already "understands" your intent in the first place, but that's just projecting human ability onto LLMS. I would guess this technique is similar to how reasoning modes seems to improve output quality, though I may misunderstand how reasoning modes work.
This is a great idea, and worth doing.
An other option in Claude code, that can be worth trying, is the planning mode, which you start with ctrl+tab. Have it plan out what it's going to do, and keep iterating it, until the plan seems sound.
Tbh. I wish I've found the planning mode earlier, it's been such a great help.
I've "worked" with Claude Code to find a long standing set of complex bugs over the last couple of days, and it can do so much more. It's come up with hypotheses, tested them, used gdb in batch mode when the hypotheses failed in order to trace what happened at the assembly level, and compared with the asm dump of the code in question.
It still needs guidance, but it quashed bugs yesterday that I've previously spent many days on without finding a solution for.
It can be tricky, but they definitely can be significant aid for even very complex bugs.
Cursor BugBot is pretty good for this, we did the free trial and it was so popular with our devs that we ended up keeping it. Occasional false positives aside, it's very useful. It saves time for both the PR submitter and the reviewer.
I've had reasonably good success with asking Claude things like: "There's a bug somewhere that is causing slow response times on several endpoints, including <xyz>. Sometimes response times can get to several seconds long, and don't look correlated with CPU or memory usage. Database CPU and memory also don't seem to correlate. What is the issue?" I have to iterate a few times but it's hinted me a few really tricky issues that would have probably taken hours to find.
I found GPT-5 to be very much less sycophantic than other models when it comes to this stuff, so your mention of 'everything looking great yay good job high-five' surprises me. Using it via Codex CLI it often questions things. Gemini 2.5 Pro is also good on this.
> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
I explicitly asked it to read all the code (within Cline) and it did so, gave me a dozen action items by the end of it, on a Django project. Most were a bit nitpicky, but two or three issues were more serious. I found it pretty useful!
In an application I'm working on, I use gpt-oss-20B. In a prompt I dump in the OWASP Top 10 web vulnerabilities, and a note that it should only comment on "definitive vulnerabilities". Has been pretty effective in finding vulnerabilities in the code I write (and it's one of the poorest-rated models if you look at some comments).
Where I still need to extend this, is to introduce function calling in the flow, when "it has doubts" during reasoning, would be the right time to call out a tool that would expand the context its working with (pull in other files, etc).
> (and it's one of the poorest-rated models if you look at some comments).
Yeah, don't listen to "wisdom of the crowd" when it comes to LLM models, there seems to be a ton of fud going on, especially on subreddits.
GPT-OSS was piled on for being dumb in the first week of release, yet none of the software properly supported it at launch. As soon as it was working properly in llama.cpp, it was clear how strong the model was, but at that point the popular sentiments seems to have spread and solidified.
I use Zed's "Ask" mode for this all the time. It's a read only mode where the LLM focuses on figuring out the codebase instead of modifying it. You can toggle it freely mid conversation.
i've had great success with both chatGPT and claude with the prompt "tell me how this sucks" or "why is this shit". being a bit more crass seems to bump it out of the sycophantic mode, and being more open-ended in the type of problems you want it to find seems to yield better results.
but i've been limiting it to a lot less than 20k LoC, i'm sticking with stuff i can just paste into the chat window.
Suggestion: run a regex to remove those FIXME comments first, then try the experiment again.
I often use Claude/GPT-5/etc to analyze existing repositories while deliberately omitting the tests and documentation folders because I don't want them to influence the answers I'm getting about the code - because if I'm asking a question it's likely the documentation has failed to answer it already!
Yeah this is really fair play to Daniel Stenberg that he still approached these AI generated bug reports with an open mind after all the problems he's had.
I think the big difference is that these aren't AI generated bug reports. They are bugs found with the assistance of AI tools that were then properly vetted and reported in a responsible way by a real person.
From what I understand some of the bugs where in code the AI made up on the spot, other bug reports had example code that didn't even interact with curl. These things should be relatively easy to verify by a human, just do a text search in the curl source to see if the AI output matches anything.
Hard to compute, easy to verify things should be the case where AI excel at. So why do so many AI users insist on skipping the verify step?
The issue I keep seeing with curl and other projects is that people are using AI tools to generate bug reports and submitting them without understanding (that's the vetting) the report. Because it's so easy to do this and it takes time to filter out bug report slop from analyzed and verified reports, it's pissing people off. There's a significant asymmetry involved.
Until all AI used to generate security reports on other peoples' projects is able to do it with vanishingly small wasted time, it's pretty assholeish to do it without vetting.
Concerning HackerOne: "We now ban every reporter INSTANTLY who submits reports we deem AI slop. A threshold has been reached. We are effectively being DDoSed. If we could, we would charge them for this waste of our time"
Some of those bugs, like using the wrong printf-specifier for a size_t, would be flagged by the compiler with the right warning flags set. An AI oracle which tells me, "your project is missing these important bug-catching compiler warning flags," would be quite useful.
A few of these PRs are dependabot PRs which match on "sarif", I am guessing because the string shows up somewhere in the project's dependency list. "Joshua sarif data" returns a more specific set of closed PRs. https://github.com/curl/curl/pulls?q=is%3Apr+Joshua+sarif+da...
No, he's still dealing with a flood of crap, even in the last few weeks, off more modern models.
It's primarily from people just throwing source code at an LLM, asking it to find a vulnerability, and reporting it as-read, without having any actual understanding of if it is or isn't a vulnerability.
The difference in this particular case is it's someone who is:
1) Using tools specifically designed for security audits and investigations.
2) Takes the time to read and understand the vulnerability reported, and verifies that it is actually a vulnerability before reporting.
Point 2 is the most significant bar that people are woefully failing to meet and wasting a terrific amount of his time. The one that got shared from a couple of weeks ago https://hackerone.com/reports/3340109 didn't even call curl. It was straight up hallucination.
I think it's more about how people are using it. An amateur who spams him with GPT-5-Codex produced bug reports is still a waste of his time. Here a professional ran the tools and then applied their own judgement before sending the results to the curl maintainers.
I keep irritating people with this observation but this was the status quo ante before AI, and at least an AI slop report shows clear intent; you can ban those submitters without even a glance at anything else they send.
The last time I was staffed on a project that had to do this, we were looking at many dozens per day, virtually all of them bogus, many attached to grifters hoping to jawbone the triage person into paying a nominal fee to get them to shut up. It would be weird if new tooling like LLMs didn't accelerate it, but that's all I'd expect it to do.
It's probably also the difference of idiots hoping to cash out/get credit for vulnerabilities by just throwing ChatGPT at the wall compared to this where it seems a somewhat seasoned researcher is trialing more customized tools.
It wasn't immediately obvious to me what the AI tools were? He mentioned that multiple other tools failed to find anything, so I'm very curious to hear what made this strategy so superior.
Love this take actually and have been working on this and published this way back 2023/2024. Recently, I've been inspired by Claude-code & Cline agentic flow + tool looping, I experimented the same with tools like file_read, dir_list and throwing in few sast tools, security prompts on Wordpress plugin ecosystem (say with 10k-100k active installation) and scanned around ~600 and to my surprise it yielded ~45 critical, ~120 high severity issues and accounting 20% for non-reachability vuln. Spent around 6$ and ~40 million tokens with grok-4 fast reasoning model and the results were impressive, I gave a try with claude-sonnet but significantly rate-limited despite having 50$ credits from anthropic for research.
Now that is how LLM assistance for coding can be useful. Would be interesting to know which set of tools was used exactly. How might one reproduce this kind of assistance for other code bases?
When I read “we consider nread == 0 as reading a byte and we shouldn’t” I immediately think of all the things that look like bugs but are there because some critical piece of infrastructure relies on that behavior. AI isn’t going to know about that unless you tell it, and the problem is that there’s plenty of folks who have job security precisely because they don’t write that down.
So he likes ZeroPath. Does that get us any further? No, the regular subscription costs $200 and the free one-time version looks extremely limited and requires yet another login.
Also of course, all low hanging fruit that these tools detect will be found quickly in open source (provided that someone can afford a subscription), similar to the fact that oss-fuzz has diminishing returns.
I am a bit worried about the abuse of those tools. I wonder if the tools have policies and mechanisms around those (no clue how, but like forced disclosure if they detect that scanned code is OSS, free usage for OSS teams). Seems otherwise like great tools to accelerate finding 0-zero days. Maybe ever worse, when building supply chain attacks, one could relatively easily test against the detection mechanism before contributing malicious code (wonder if they could detect and block malicious use/probing). However, i guess long term it will make our software more secure. I guess it is always an arms race.
More interesting to me is how to stop these bugs from occurring in the first place. The example given in the thread is the kind of bug that C (and mutation) excels at creating.
I work in a ML security R&D startup called Pwno, we been working on specifically putting LLMs into memory security for the past year, we've spoken at Black Hat, and we worked with GGML (llama.cpp) on providing a continuous memory security solution by multi-agents LLMs.
Somethings we learnt alone the way, is that when it comes to specifically this field of security what we called low-level security (memory security etc.), validation and debugging had became more important than vulnerability discovery itself because of hallucinations.
From our trial-and-errors (trying validator architecture, security research methodology e.g., reverse taint propagation), it seems like the only way out of this problem is through designing a LLM-native interactive environment for LLMs, validate their findings of themselves through interactions of the environment or the component. The reason why web security oriented companies like XBOW are doing very well, is because how easy it is to validate. I seen XBOW's LLM trace at Black Hat this year, all the tools they used and pretty much need is curl. For web security, abstraction of backend is limited to a certain level that you send a request, it whether works or you easily know why it didn't (XSS, SQLi, IDOR). But for low-level security (memory security), the entropy of dealing with UAF, OOBs is at another level. There are certain things that you just can't tell by looking at the source but need you to look at a particular program state (heap allocation (which depends on glibc version), stack structure, register states...), and this ReACT'ing process with debuggers to construct a PoC/Exploit is what been a pain-in-the-ass. (LLMs and tool callings are specifically bad at these strategic stateful task, see Deepmind's Tree-of thoughts paper discussing this issue) The way I've seen Google Project Zero & Deepmind's Big Sleep mitigating this is through GDB scripts, but that's limited to a certain complexity of program state.
When I was working on our integration with GGML, spending around two weeks on context, tool engineering can already lead us to very impressive findings (OOBs); but that problem of hallucination scales more and more with how many "runs" of our agentic framework; because we're monitoring on llama.cpp's main branch commits, every commits will trigger a internal multi-agent run on our end and each usually takes around 1 hours and hundreds of agent recursions. Sometime at the end of the day we would have 30 really really convincing and in-depth reports on OOBs, UAFs. But because how costly to just validate one (from understanding to debugging, PoC writing...) and hallucinations, (and it is really expensive for each run) we had to stop the project for a bit and focus solving the agentic validation problem first.
I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more.
> I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more
Thanks for sharing your experience ! It correlates with this recent interview with Sutton [1]. That real intelligence is learning from feedback with a complex and ever changing environment. What an LLM does is to train on a snapshot of what has been said about that environment and operate on only on that snapshot.
So whereas previously, repo owners were getting flooded with AI-generated PRs that were complete slop, now they're going to be flooded with PRs that contain actual bugfixes. IDK which problem is worse!
"AI" tools can be very powerful once you approach them as what they are: very good pattern matchers and generators. This ability far surpasses anything a human could do. Detecting potential issues in software is a great application of the technology.
The key word is "potential", though. They're still wildly unpredictable and unreliable, which is why an expert human is required to validate their output.
The big problem is the people overhyping the technology, selling it as "AI", and the millions deluded by the marketing. Amidst the false advertising, uncertainty, and confusion, people are forced to speculate about the positive and negative impacts, with wild claims at both extremes. As usual, the reality is somewhere in the middle.
There are some good SAST scanners and many bad commercial scanners.
Many people advocate for the use of AI technology for SAST testing. There are even people and companies that deliver SAST scanners based on AI technology. However: Most are just far from good enough.
In the best case scenario, you’ll only be disappointed. But the risk of a false sense of security is enormous.
It's weird that the discussion has collapsed down to "autopilots" vs. "abstention". I'm thrilled to be converging on an understanding that it instead "people who understand what they're trying to do" vs. "vibe coders".
In defense of the cynics, I get the impression in a situation where (a) there's a lot of company marketing hype in such a competitive market that begs cynicism, and (b) we're constantly learning the boundary of trained LLMs can actually do (and can't), as well as unusual emergent workflows, that really do make a difference.
Something sounds fishy in this. Has these bugs really been found by AI? (I don't think they were).
If you read Corgea's (one of the products used) "whitepaper", it seems that AI is not the main show:
> BLAST addresses this problem by using its AI engine to filter out irrelevant findings based on the context of the application.
It seems that AI is being used to post-process the findings of traditional analyzers. It reduces the amount of false positives, increasing the yield quality of the more traditional analyzers that were actually used in the scan.
Zeropath seems to use similar wording like "AI-Enabled Triage" and expressions like "combining Large Language Models with AST analysis". It also highlights that it achieves less false positives.
I would expect someone who developed this kind of thing to setup a feedback loop in which the AI output is somehow used to improve the static analysis tool (writing new rules, tweaking existing ones, ...). It seems like the logical next step. This might be going on on these products as well (lots of in-house rule extensions for more traditional static analysis tools, written or discovered with help of AI, hence the "build with AI" headline in some of them).
Don't get me wrong, this is cool. Getting an AI to triage a verbose static analysis report makes sense. However, it does not mean that AI found the bugs. In this model, the capabilities of finding relevant stuff are still capped at the static analyzer tools.
I wonder if we need to pay for it. I mean, now that I know it is possible (at least in my head), it seems tempting to get open source tools, set them to max verbosity, and find which prompts they are using on (likely vanilla) coding models to get them to triage the stuff.
Hi there, I'm Ahmad, CEO at Corgea, and the author of the white paper. We do actually use LLMs to find the vulnerabilities AND triage findings. For the majority of our scanning, we don't use traditional static analysis. At the core of our engine is the LLM reading the line of code to find CWEs in them.
I don't think many people here are interested in how something works. They want to see the headline "Curl developer finally convinced by AI!" and otherwise drop anecdotes about Claude Code etc.
All comments that want to know more are at the bottom.
That doesn’t really convey that these bug reports were for real issues and greatly appreciated unlike the slop that Daniel is known for complaining about which I think that’s the real story here.
I will spend longer considering my title next time.
Hi, I'm Etienne, one of the cofounders @ ZeroPath.
We do not use traditional static analyzers; our engine was built from the ground up to use LLMs as a primitive. The issues ZeroPath identified in Joshua's post were indeed surfaced and triaged by AI.
Joshua describes it as follows: "ZeroPath takes these rules, and applies (or at least the debug output indicates as such) the rules to every .. function in the codebase. It then uses LLM’s ability to reason about whether the issue is real or not."
Would you say that is a fair assessment of the LLM role in the solution?
I think you owe me a better apology than that. I disagree with your evaluation of the response to that post, strongly, but more importantly I didn't bring it up in the first place, and the claim you made about it (intentionally or not; awfully weird to land at my company's name) was personal and scurrilous.
I emailed you about two hours ago based on the last email exchange we had, about 7 years ago. I included the email addresses I had for you at the time, which included Matasano. Presumably this is why someone mentioned it in a comment to you.
refulgentis mentioned your fly.io article as evidence that the "AI" science is settled.
I remembered quite a bit of a blowback and unfortunately saw the Mataroa article last week. It contains a reference to Matasano; I didn't look closely at the URL and mentally classified Matasano as a classic informal hacker company with extreme free speech blogs, where you could criticize the founder with a typical hyperbolic article.
I assumed based on your post and the post you replied to that it is literally impossible to prove any AI is involved, and I trust both of you on that.
Given that, I'm afraid all the interlocution I have to offer is the thing you commented on, the mind of a downvoter, i.e. positing that every downvoter must have details, including details we[1] can't find.
Past that, I'm afraid to admit I am having difficulty understanding how the slides are related, and I don't even know what Matasano is -- is that who owns fly.io? I thought they were "indie" -- I'm embarrassed to admit I thought Monsanto at first. I do know how much I've used AI to code, so I can vouch for tptacek's post.
[1] royal we, i.e. I trust you and OP so completely on what it findable vs. not findable that I trust we can't establish with 100% certainty any sort of AI-based thingy was used at all. To be clear, too, 100% is always too high of a bar, I mean to say we can even't establish at 90% confidence. Even 1% confidence. If all we have is their word to go on, it's impossible.
Do you believe AI is at the core of these security analyzers? If so, why the personal story blogpost? You can just explain me in technical terms why is that so.
Claiming to work for Google does not work as an authority card for me, you still have to deliver a solid argument.
Look, AI is great for many things, but to me these products sounds like chocolate that is actually just 1% real chocolate. Delicious, but 99% not chocolate.
I had a conversation in a chat room yesterday about AI-assisted math tutoring where a skeptic said that the ability of GPT5 to effortlessly solve quotient differentials or partial fraction decomposition or rational inequalities wasn't indicative of LLM improvements, but rather just represented the LLMs driving CAS tools and thus didn't count.
As a math student, I can't possibly care less about that distinction; either way, I paste in a worked problem solution and ask for a critique, and either way I get a valid output like "no dummy multiply cos into the tan before differentiating rather than using the product rule". Prior to LLMs, there was no tool that had that UX.
In the same way: LLMs are probably mostly not off the top of their "heads" (giant stacks of weight matrices) axiomatically deriving vulnerabilities, but rather just doing a very thorough job of applying existing program analysis tools, assembling and parallel-evaluating large numbers of hypothesis, and then filtering them out. My interlocutor in the math discussion would say that's just tool calls, and doesn't count. But if you're a vulnerability researcher, it doesn't matter: that's a DX that didn't exist last year.
As anyone who has ever been staffed on a project triaging SAST tool outputs before would attest: it extremely didn't exist.
I don't care if it counts as true LLM brilliance or not.
If it doesn't matter if it's AI or not, just that they're good tools, why even advertise the AI keyword all over it? Just say "best in class security analysis toolset". It's proprietary anyway, you can't know how much of it is actually AI (unless you reproduce its results, which is the core argument you missed here).
Because that's not accurate. The underlying program analysis tooling already existed, but the LLM glue logic is what makes it effective. You could as a human replicate it with those preexisting tools, but you won't, in the same way you wouldn't model your whole project in a prover and solve it with formal methods; it was possible, but that possibility isn't meaningful.
Allegedly (it's proprietary, we don't know). Maybe it's the triage approach, and there are undiscovered non-LLM triage techniques that would surpass it.
If I were to guess, the AI naming was for marketing purposes (to ride on the hype train), not because it accurately describes the product (even though it might accurately describe the product).
Most importantly, how is that it's so effective? I want to know. Perhaps you and some others just want to celebrate an LLM win. That's fine, but I want to know how it works.
I'd say my guess is fair, and it's a viable approach for someone trying to create a similar tool. If I were to try and replicate this, I would definitely start with an existing static analyzer. For example, I would do it with phpstan (just because I know it a little bit better).
I would extend it so it becomes more verbose than what it currently is (something humans don't want, but machines might benefit from it). Perhaps I would introduce some rules that make it report things that aren't even issues, but just information I can gauge from the AST (like, does this controller has a middleware? If so, emit something in the report). Then I would attempt to use that enriched report as the input for a coding model, and experiment with different prompts and different granularity units on the input.
It sounds reasonable, doesn't it? I could describe that approach as "LLM right in the core of the solution", but I know by heart that in that arrangement, the quality of the final product is still capped by the static analyzer and what it can detect and describe. It doesn't matter that the LLM is what makes it better. My wheat farm is still about wheat, not the fancy sift I recently bought to separate it from the chaff.
I don't understand why this sounds so offensive to some of the readers here. I was just thinking "how would I use AI in such a product" and the only way I can come up with is this way in which is not the main show.
I mean, my experience with LLMs also confirm that. Prompting "find me bugs" or stuff like that almost never works. It works better if I get an error and ask it to explain it to me, giving the application context. The static analyzer is there to give this initial kick, to create these nucleation sites in which the LLM will crystalize answers upon.
This sounds like the most viable, easy to make product that can find bugs with LLMs. It's only offensive if that's actually what these products are doing, it's not supposed to be known and I struck a nerve or something.
Why does it work well now, after 20 years of this kind of tooling being next to useless? Do you work in this space? How much about how bad SAST tools are do I need to explain?
Maybe it's unleveraged potential, I don't know. I am also not entirely convinced that they're next to useless. Sanitizers, for example, are excellent for mitigating all sorts of security issues. Those are traditional static analysis tools (that, by the way, fit the arrangement I described of using these reports as nucleation sites for LLM triage).
I did walked you through how I would do it. Would you change your response if I said I work in this space? It seems like an irrelevant point in this discussion.
You don't need to explain anything. This is on a flagged thread, obscure and unseen. I'm actually surprised by how invested you are in this apparently irrelevant matter.
I'm a software security person! This is not irrelevant to me.
In summary: the existing program analysis tooling in this space has been ineffective for decades, despite hundreds of millions of dollars invested in the tooling. If it is effective now, that strongly indicates that the LLM component of it isn't irrelevant; nothing else in the field has changed.
Note that everybody in this story concedes the LLM involvement. The only person who isn't is you, and you're not actually involved. (I'm not either, but I'm agreeing with --- checks again --- everybody involved in the story).
I concede the LLM involvement. But I want to be more specific in the description of the role it plays in the solution.
If it is a central role, then there is nothing to loose from describing it better. That's why this feels so strange. You disagree with me, but you don't present an arrangement in which the LLM plays a role different to what I described. In fact, no one here did. It's like you're not disagreeing with me, but trying to make me stop describing how to achieve a similar quality system out of free pieces.
I don't mean to aggravate you. I do mean to offer some insight in the mindset of the people the person I was replying to was puzzled by. I'm calmed by the fact that if we're both here, we both value one of the HN sayings I'm very fond of: come with curiosity.
> Do you believe AI is at the core of these security analyzers?
Yes.
> If so, why the personal story blogpost?
When I am feeling intensely, and people respond to me as I'm about to respond to you, I usually get very frustrated. Apologies in advance if you suffer from that same part of being human, I don't mean anything about you or your positions by this:
I don't know what you mean.
Thus, I may be answering wrong with the following: the person I replied to indicated all downvoters must know every detail, and as the, well lets use your phrasing, personal story blogpost, I just assume you mean my comment, leads with: "I believe there's a little more going on than everyone knowing every detail already, or presumably, being wrong to downvote.
Full case study of a downvoter at work:"
> Claiming to work for Google
I claimed the opposite! I'm a jobless hack :) (quit in 2023)
> does not work as an authority card for me,
Looking at it, the thing isn't "I worked at Google therefore AI good" it's "I worked at Google and on a specific well-known project, the company's design language, used AI pre-ChatGPT to great effect. It's unclear to me why this use case would be unbelievable years later"
> you still have to deliver a solid argument.
What are we arguing? :) (I'm serious! Apologies, again, if it comes off as flippant. If you mean I need to deliver a solid argument the tools must have AI, I assume if said details were available you would have found them, you seem well-considered and curious. I meant to explain the mind of a downvoter who yet cannot recite details as yet unavailable to the public to the person I replied to, not to verify the workflow step by step.)
The argument is that these high-quality security analyzers seem to use AI as a triage mechanism, and the quality of the analysis is still capped by the quality of the static analysis tool.
One of the tools provide a whitepaper, that you can read here:
It seems to explicitly put AI in this coadjuvant role, contradicting the HN title "found by AI".
Neither me or the other commenter actually dismissed AI as useless. I can't speak for him, but to me, it seems actually useful in this arrangement. However, not "I'll pay for a subscription" levels of useful.
Since it's just triage, it seems that trying to reproduce the idea using free tools might be worth a shot (and that's the idea of finding out where the AI component lies in the system). What I said is very doable (plug the output of traditional tools into vanilla coding LLMs prompts). It also looks a lot like this Corgea schematic:
Instead have your AI look for problems - then have it create deterministic tools and let tools catch the issues in a repeatable, understandable, auditable way. Have it build short, easy to understand scripts you can commit to your repo, with files and line numbers and zero/nonzero exit codes.
It’s that key step of transforming AI insights into detection tools that transforms your outcomes from probabilistic to deterministic. Ask it to optimize the tools so they run in seconds. You can leave them in the codebase forever as linters, integrate them in your CI, and never have that same bug again.
I have to admit, I expected a couple of "You should rewrite it in Rust" hipster posts by now... Maybe they caught on that those types of posts were not having the effect they thought they would? I kid, I kid... mostly
THe Canadian gubernment should probably get a bug bounty program so I can present some of my findings to the them that I found using ai and tested or mapped things out on some of their public facing apps on the app store/play store.
This is exactly what I'd want from an 'AI coding companion'.
Don't write or fix the code for me (thanks but I can manage that on my own with much less hassle), but instead tell me which places in the code look suspicious and where I need to have a closer look.
When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
ChatGPT is even less useful since it basically just spend a lot of time to tell me 'everything looking great yay good job high-five!'.
So far, traditional static code analysis has been much more helpful in finding actual bugs, but static analysis being clean doesn't mean there are no logic bugs, and this is exactly where LLMs should be able to shine.
If getting more useful potential-bugs-information from LLMs requires an extensively customized setup then the whole idea is getting much less useful - it's a similar situation to how static code analysis isn't used if it requires extensive setup or manual build-system integration instead of just being a button or menu item in the IDE or enabled by default for each build.
This is a point I see discussed surprisingly little. Given that many (most?) programmers like designing and writing code (excluding boilerplate), and not particularly enjoy reviewing code, it certainly feels backwards to make the AI write the code and relegate the programmer to reviewing it. (I know, of course, that the whole thing is being sold to stakeholders as "LoC machine goes brrrr" – code review? what's that?)
Creativity is fun. AIs automate that away. I want an AI that can do my laundry, fold it, and put it away. I don't need an AI to write code for me. I don't mind AI code review, it sometimes has a valid suggestion, and it's easy enough to ignore most of the rest of the time.
I was thinking this again just yesterday. Do my laundry correctly and get it put away. Organized my storage. Clean the bathroom. Do the dishes. Catalog my pantry, give me recipes, and keep it correctly stocked. Maybe I'm just a simple creature but like, these are the obvious problems in my life I'll pay to have go away so why are we taking away the fun stuff instead?
You can already pay to have all those issues go away.
No you can't. You can only pay to transfer them to someone else on top of their own.
It's fundamentally different from how a machine or some code makes a task actually go away or at least become smaller.
There are already cheap, domestic robots for cleaning dishes, cleaning the floor, cleaning clothes, making coffee, heating and cooling food, turning screws, drilling holes and so on. All those robots represent a greater than 90 percent (and sometimes a greater than 99 percent) savings in time relative to doing the same tasks manually. You still have to move the objects they operate on around within your house but that's mostly the only part of the task you have to do.
I think the 10 percent is more work than you give it credit for.
Unfortunately many things aren't dishwasher safe, some things don't fit in the dishwasher, and often certain types of food are not properly washed off in the dishwasher.
> All those robots represent a greater than 90 percent savings in time relative to doing the same tasks manually.
Lol, nope.
Dishwashers solve at best some 50% of the hassle that are the easy to wash table dishes, while being completely unable to clean oven ones. Floor cleaners solve a 5 minutes task in a couple-of-days-long house upkeep. Coffee makers... don't really automate anything, why did you list them here? And there's no automation available for heating and cooling food. And the part about drilling and turning screws also isn't automation at all.
The only thing on your list that is close to solved is clothes cleaning. And there's the entire ironing thing that is incredibly resistant to solving. But yeah, that puts it way beyond 90% solved.
As someone who played the roomba game quite a bit - you transfer the problem of vacuuming to the problem of very frequent robot cleaning. I've saved more time switching to a high powered central vac than I ever did with constantly cleaning the robot because I had the audacity to own a fluffy dog.
Also people claiming cleaning isn't "creative" or "fun". Steam has a whole genre of games simulating cleaning stuff because the act of cleaning is extremely fun and creative to a lot of people: https://store.steampowered.com/app/246900/Viscera_Cleanup_De... being a great example
Actually I do NOT want my robot to do my laundry for me! And because I'm garbage at painting and comparatively better at laundry, I DO want it to paint for me.
$500 robots (ie Mova) are very impressive nowadays. You can even integrate plumbing into them so you really not doing much.
Washer dryer combos are good too. Folding laundry is biggest pain of my life so far. Also unloading dishwasher, but fix is easy here - get double one.
> Creativity is fun. AIs automate that away.
I've been developing with LLMs on my side for months/about a year now, and feels like it's allowing me to be more creative, not less. But I'm not doing any "vibe-coding", maybe that's why?
The creative parts (for me) is coming up with the actual design of the software, and how it all fits together, what it should do and how, and I get to do that more than ever now.
Same. I think there are two types of devs. Those that love designing the individual building blocks and those that wanna stack the blocks together to make something new.
At this point AI is best at the first thing and less good at the second. I like stacking blocks together. If I build a beautiful UI I don't enjoy writing the individual css code for every button but rather composing the big picture.
Not saying either is better or worse. But I can imagine that the people that loves to build the individual blocks like AI less because it takes away something they enjoy. For me it just takes away a step I had to do to get to the composing of the big picture.
The thing is, i love doing both. But there’s an actual rush of enjoyment when I finally figure one of the tenets of a system. It’s like solving a puzzle for me.
After that, it’s all became routine work as easy as drinking water. You explain the problem and I can quicly find the solution. Using AI at this point would be like herding cats. I already know what code to write, having a handful being suggested is distracting. Like feeling a a tune, and someone playing another melody other than the one you know.
Yea, I guess some people enjoy both.
> For me it just takes away a step I had to do to get to the composing of the big picture.
You can't successfully build the big picture on the sort of rotten foundation that AI produces though
I don't care how much you enjoy assembling building blocks over building the low level stuff, if you offload part of the building onto AI you're building garbage
I'm still faster than the cheap bots.
The creative part for me includes both the implementation and the design, because the implementation also matters. The bots get in the way.
Maybe I would be faster if I paid for Claude Code. It's too expensive to evaluate.
If you like your expensive AI autocomplete, fine. But I have not seen any demonstrable and maintainable productivity gains from it, and I find understanding my whole implementation faster, more fun, and that it produces better software.
Maybe that will change, but people told me three years ago that we would be at the point today where I could not outdo the bot;
with all due respect, I am John Henry and I am still swinging my hammer. The steam pile driving machine is still too unpredictable!
> The creative part for me includes both the implementation and the design
The implementations LLMs end up writing are predicable, because my design locks down what it needs to do. I basically know exactly what they'll end up doing, and how, but it types faster than I do, that's why I hand it off while I go on to think about the next design iteration.
I currently send every single prompt to Claude, Codex, Qwen and Gemini (looks something like this: https://i.imgur.com/YewIjGu.png), and while the all most of the time succeed, doing it like this makes it clear that they're following what I imagined they'd do during the design phase, as they all end up with more or less the same solutions.
> If you like your expensive AI autocomplete
I don't know if you mean that in jest, but what I'm doing isn't "expensive AI autocomplete". I come up with what has to be done, the design for achieving so, then hand off the work. I don't actually write much code at all, just small adjustments when needed.
> and I find understanding my whole implementation faster
Yeah, I guess that's the difference between "vibe-coding" and what I (and others) are doing, as we're not giving up any understanding or control of the architecture and design, but instead focus mostly on those two things while handing off other work.
I agree, and my flow is similar
I've made great use of AI by keeping my boundaries clear and my requirements tight, and by rigorously ensuring I understand _every_ line of code I commit
I believe software development will transition to a role closer to director/reviewer/editor, where knowledge of programming paradigms are just as important as now, but also where _communication_ skills separate the good devs from the _great_ devs
The difference between a 1x dev and a 10x dev in future will be the latter knows how to clearly and concisely describe a problem or a requirement to laymen, peers, and LLMs alike. Something I've seen many devs struggle with today (myself included)
> but also where _communication_ skills separate the good devs from the _great_ devs
I think it has been that way since forever. If you look at all the great projects, it’s rare for the guy at the helm to not be a good communicator. And at corporate job, you soend a good chunk of the year writing stuff to people. Even the code you’re writing, you think abou the next person who’s going to read it.
I think that has always been the difference. First principles.
Claude code is too expensive to evaluate?
It's 20 bucks a month
[flagged]
Exactly. I loved doing novel implementations or abstractions… and the AI excels at the part where it modifies it slightly for different contexts… aka the boring stuff.
But this is how you learn, how you find better ways, by grinding.
Getting wild ideas badly implemented on a silver plate is a slot machine, it leads nowhere but in circles.
By grinding what though? I don't wanna grind "Entering characters with my fingers", I wanna grind "Does this design work for getting X to work as I want", which is exactly the sort of things LLMs help me move faster on.
And yes, if you're just using it as a slot machine, I understand it doesn't feel useful. But I don't think that's how most people use it, at least that's not how I use it.
i've done 20 years of grinding thanks. im happy to work at a higher level of abstraction now.
depends on what abstraction level you enjoy being creative at.
Some people like creative coding, others like being creative with apps and features without much care to how it's implemented under the hood.
I like both, but IMO there is a much larger crowd for higher level creativity, and in those cases AIs don't automate the creativity away, they enable it!
> Creativity is fun. AIs automate that away.
This is the complete opposite of my experiences with using AI Coding tools heavily
Is AI automating creativity away if you come up with an idea and have it actually implement it?
Yes, because ideas are not worth much if anything. If you have an idea of a book, or a painting, and have someone else implement it, you have not done creative work. Literally, you have not created the work, brought it to existence. The creator has done the creativity.
I guess that depends on how much oversight you engage in. A lot of famous masters would oversee apprentices and step in for difficult tasks and to finish the work, yet we still attribute the work to those masters. Most of the work in science is done by graduate students, but we still attribute the lion's share of the credit to PIs.
If you write a screenplay (the idea), and direct actors to act it out according to your vision (the implementation), did you _create_ the film?
I think my answer would be "Does it matter?"
If it brings joy to you or others, who cares about the semantics of creation
A screenplay isn't just an idea; it's an implementation of an idea.
Nope. Films acknowledge a difference between the writers, the book that inspired it, and the director.
You kind of missed the “and direct actors to play it out” part. If you did all of that, that’s essentially the creator.
... What was the last word in my comment?
Most software is developer tools and frameworks to manage electrical state in machines.
Such state management messes use up a lot of resources to copy around.
As an EE working in QA future chips with a goal of compressing away developer syntax art to preserve the least amount of state management possible to achieve maximum utility; sorry self selecting biology of SWEs, but also not sorry.
Above all this is capitalism not honorific obligationism. If hardware engineers can claim more of the tech economy for our shareholders, we must.
There are plenty of other creative outlets that are much less resource intensive. Rich first world programmers are a small subset of the population and can branch out then and explore life rather than believe everyone else has an obligation to conserve the personal story of a generation of future dead.
To me, it's the natural result of gaining popularity that enough people have started to use after the hype train rolled through and are now giving honest feedback. Real honest feedback can feel like a slap in the face when all you have had is overwhelming positive feedback from those aboard the hype train.
The writing has been on the wall with so called hallucinations where LLMs just make stuff up that the hype was way out over its skiis. The examples of lawyers being fined for unchecked LLM outputs being presented as fact type of stories will continue to take the shine off and hopefully some of the raw gungho nature will slow down a bit.
I saw an article today from the BBC where travellers are using LLMs to plan their vacations and getting into trouble going places (sometimes dangerously remote ones) to visit landmarks that don't even exist:
https://www.bbc.com/travel/article/20250926-the-perils-of-le...
I'm mildly bearish on the human capacity to learn from its mistakes and have a feeling in my gut that we've taken a massive step backwards as civilization.
I could almost understand a lawyer working late the night before a brief is due and just run out of time to review the output of the LLM. How do you not look up travel destinations before heading out? That's just something I can't wrap my head around in any way of trying to be kind and seeing the other side of something
> How do you not look up travel destinations before heading out?
From the layman's perspective, they did. That's the whole problem.
because people have had their entire lives to get used to the idea that computers are reliable (sans Microsoft software)
no-one wants stochastic computers
People have blindly followed GPS routes into lakes and rivers, but that should hardly be a point against GPS
With 8 billion people on the planet, you could write a "man bites dog" story about any invention popular enough
"You never read about a plane that did not crash"
There are a lot of good AI code reviewers out there where they learn project conventions based on prior PRs and make rules from them. I've found they definitely save time and catch things I would have missed - things like cubic.dev or greptile etc etc. Especially helpful for running an open source project where code quality can have high variance and as a maintainer you may feel hesitant to be direct with someone -- the machine has no feelings so it is what it is :)
honestly? this but zoom out. machines are supposed to do the grunt work so that people can spend their time being creative and doing intangible, satisfying things but we seem to have built machines to make art, music and literature in order to free ourselves up to stack bricks and shovel manure.
codex can actually do useful reviews on pull requests, as of the last few weeks
> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
Here's a technique that often works well for me: When you get unexpectedly poor results, ask the LLM what it thinks an effective prompt would look like, e.g. "How would you prompt Claude Code to create a plan to effectively review code for logic bugs, ignoring things like FIXME and TODO comments?"
The resulting prompt is too long to quote, but you can see the raw result here: https://gist.github.com/CharlesWiltgen/ef21b97fd4ffc2f08560f...
From there, you can make any needed improvements, turn it into an agent, etc.
I've found this a really useful strategy in many situations when working with LLMS. It seems odd that it works, since one one think its ability to give a good reply to such a question means it already "understands" your intent in the first place, but that's just projecting human ability onto LLMS. I would guess this technique is similar to how reasoning modes seems to improve output quality, though I may misunderstand how reasoning modes work.
Works for humans the same? Even if you know how to do a complex project, it helps to first document the approach, and then follow it.
This is a great idea, and worth doing. An other option in Claude code, that can be worth trying, is the planning mode, which you start with ctrl+tab. Have it plan out what it's going to do, and keep iterating it, until the plan seems sound. Tbh. I wish I've found the planning mode earlier, it's been such a great help.
I have also had some success with this method
I asked ChatGPT to analyze its weaknesses and give me a pre-prompt to best help mitigate them and it gave me this: https://pastebin.com/raw/yU87FCKp
I've found it very useful to avoid sycophancy and increase skepticism / precision in the replies it gives me
I've "worked" with Claude Code to find a long standing set of complex bugs over the last couple of days, and it can do so much more. It's come up with hypotheses, tested them, used gdb in batch mode when the hypotheses failed in order to trace what happened at the assembly level, and compared with the asm dump of the code in question.
It still needs guidance, but it quashed bugs yesterday that I've previously spent many days on without finding a solution for.
It can be tricky, but they definitely can be significant aid for even very complex bugs.
Cursor BugBot is pretty good for this, we did the free trial and it was so popular with our devs that we ended up keeping it. Occasional false positives aside, it's very useful. It saves time for both the PR submitter and the reviewer.
I've had reasonably good success with asking Claude things like: "There's a bug somewhere that is causing slow response times on several endpoints, including <xyz>. Sometimes response times can get to several seconds long, and don't look correlated with CPU or memory usage. Database CPU and memory also don't seem to correlate. What is the issue?" I have to iterate a few times but it's hinted me a few really tricky issues that would have probably taken hours to find.
Definitely optimistic for this way to use AI
I found GPT-5 to be very much less sycophantic than other models when it comes to this stuff, so your mention of 'everything looking great yay good job high-five' surprises me. Using it via Codex CLI it often questions things. Gemini 2.5 Pro is also good on this.
> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
I explicitly asked it to read all the code (within Cline) and it did so, gave me a dozen action items by the end of it, on a Django project. Most were a bit nitpicky, but two or three issues were more serious. I found it pretty useful!
My thoughts exactly. So many actually useful tools could be built on top of LLMs, but most of the resources go into the no code space.
I get it though, non programmers or weak programmers don't scrutinise the results and are more likely to be happy to pay. Still, bit of a shame.
Maybe these tools exist, but at least to me, they don't surface among all the noise.
In an application I'm working on, I use gpt-oss-20B. In a prompt I dump in the OWASP Top 10 web vulnerabilities, and a note that it should only comment on "definitive vulnerabilities". Has been pretty effective in finding vulnerabilities in the code I write (and it's one of the poorest-rated models if you look at some comments).
Where I still need to extend this, is to introduce function calling in the flow, when "it has doubts" during reasoning, would be the right time to call out a tool that would expand the context its working with (pull in other files, etc).
> (and it's one of the poorest-rated models if you look at some comments).
Yeah, don't listen to "wisdom of the crowd" when it comes to LLM models, there seems to be a ton of fud going on, especially on subreddits.
GPT-OSS was piled on for being dumb in the first week of release, yet none of the software properly supported it at launch. As soon as it was working properly in llama.cpp, it was clear how strong the model was, but at that point the popular sentiments seems to have spread and solidified.
Tool calling is the best lever for getting value out of LLMs
I use Zed's "Ask" mode for this all the time. It's a read only mode where the LLM focuses on figuring out the codebase instead of modifying it. You can toggle it freely mid conversation.
Indeed, in many machine learning models, classification is always easier than generation. Maybe that's consistent with chatgpts intelligence level
i've had great success with both chatGPT and claude with the prompt "tell me how this sucks" or "why is this shit". being a bit more crass seems to bump it out of the sycophantic mode, and being more open-ended in the type of problems you want it to find seems to yield better results.
but i've been limiting it to a lot less than 20k LoC, i'm sticking with stuff i can just paste into the chat window.
Really surprised that nobody in this thread mentions using Gemini 2.5 Pro. Its 1m context really shines for code review.
GPT 5 has been disappointing with thinking and without.
Suggestion: run a regex to remove those FIXME comments first, then try the experiment again.
I often use Claude/GPT-5/etc to analyze existing repositories while deliberately omitting the tests and documentation folders because I don't want them to influence the answers I'm getting about the code - because if I'm asking a question it's likely the documentation has failed to answer it already!
I really didn't expect a story about curl and AI to be positive for once.
Some history: https://hn.algolia.com/?q=curl+AI
Yeah this is really fair play to Daniel Stenberg that he still approached these AI generated bug reports with an open mind after all the problems he's had.
I think the big difference is that these aren't AI generated bug reports. They are bugs found with the assistance of AI tools that were then properly vetted and reported in a responsible way by a real person.
Basically using AI the way we have used linters and other static analysis tools, rather than thinking it's magic and blindly accepting its output.
In the defense of the language models, the bugs were written by humans in the first place. Human vetting is not much of a defense.
From what I understand some of the bugs where in code the AI made up on the spot, other bug reports had example code that didn't even interact with curl. These things should be relatively easy to verify by a human, just do a text search in the curl source to see if the AI output matches anything.
Hard to compute, easy to verify things should be the case where AI excel at. So why do so many AI users insist on skipping the verify step?
> Human vetting is not much of a defense.
The issue I keep seeing with curl and other projects is that people are using AI tools to generate bug reports and submitting them without understanding (that's the vetting) the report. Because it's so easy to do this and it takes time to filter out bug report slop from analyzed and verified reports, it's pissing people off. There's a significant asymmetry involved.
Until all AI used to generate security reports on other peoples' projects is able to do it with vanishingly small wasted time, it's pretty assholeish to do it without vetting.
Yep, I feel for the guy. He's had to deal with a hell of a lot of frustrating crap from AI slop to crazy end-users. Kudos for staying on top of it.
Always fun to wake up (ok; I didn't wake up, I got off a 10 hour flight) to see my work on the front page of hn.
I'll be doing a retrospective in a few weeks when the dust has settled, as well as new tools I've been made aware of.
I thoroughly enjoyed the post, one of the few lengthier blog posts I read start-to-finish.
Seems like ZeroPath might be worth looking into if the price is reasonable
Thank you, it means a lot.
I’m curious what tools that may be.
Here are 55 closed PRs in the curl repo which credit "sarif data" - I think those are the ones Daniel is talking about here https://github.com/curl/curl/pulls?q=is%3Apr+sarif+is%3Aclos...
This is notable given Daniel Stenberg's reports of being bombarded by total slop AI-generated false security issues in the past: https://www.linkedin.com/posts/danielstenberg_hackerone-curl...
Concerning HackerOne: "We now ban every reporter INSTANTLY who submits reports we deem AI slop. A threshold has been reached. We are effectively being DDoSed. If we could, we would charge them for this waste of our time"
Also this from January 2024: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
Some of those bugs, like using the wrong printf-specifier for a size_t, would be flagged by the compiler with the right warning flags set. An AI oracle which tells me, "your project is missing these important bug-catching compiler warning flags," would be quite useful.
A few of these PRs are dependabot PRs which match on "sarif", I am guessing because the string shows up somewhere in the project's dependency list. "Joshua sarif data" returns a more specific set of closed PRs. https://github.com/curl/curl/pulls?q=is%3Apr+Joshua+sarif+da...
The models used have improved quite well since then, I guess his change of opinion shows that.
No, he's still dealing with a flood of crap, even in the last few weeks, off more modern models.
It's primarily from people just throwing source code at an LLM, asking it to find a vulnerability, and reporting it as-read, without having any actual understanding of if it is or isn't a vulnerability.
The difference in this particular case is it's someone who is: 1) Using tools specifically designed for security audits and investigations. 2) Takes the time to read and understand the vulnerability reported, and verifies that it is actually a vulnerability before reporting.
Point 2 is the most significant bar that people are woefully failing to meet and wasting a terrific amount of his time. The one that got shared from a couple of weeks ago https://hackerone.com/reports/3340109 didn't even call curl. It was straight up hallucination.
I think it's more about how people are using it. An amateur who spams him with GPT-5-Codex produced bug reports is still a waste of his time. Here a professional ran the tools and then applied their own judgement before sending the results to the curl maintainers.
I keep irritating people with this observation but this was the status quo ante before AI, and at least an AI slop report shows clear intent; you can ban those submitters without even a glance at anything else they send.
The current scale of poor reports was absolutely not the status quo before AI
The last time I was staffed on a project that had to do this, we were looking at many dozens per day, virtually all of them bogus, many attached to grifters hoping to jawbone the triage person into paying a nominal fee to get them to shut up. It would be weird if new tooling like LLMs didn't accelerate it, but that's all I'd expect it to do.
It's probably also the difference of idiots hoping to cash out/get credit for vulnerabilities by just throwing ChatGPT at the wall compared to this where it seems a somewhat seasoned researcher is trialing more customized tools.
This should probably link to the original blog post by Joshua Rogers:
https://joshua.hu/llm-engineer-review-sast-security-ai-tools... ("Hacking with AI SASTs: An overview of 'AI Security Engineers' / 'LLM Security Scanners' for Penetration Testers and Security Teams")
The PDF slide deck that accompanies that post includes some screenshots of the tools that were used: https://joshua.hu/files/AI_SAST_PRESENTATION.pdf
Thanks—we've added that link to the toptext above.
It wasn't immediately obvious to me what the AI tools were? He mentioned that multiple other tools failed to find anything, so I'm very curious to hear what made this strategy so superior.
there's a blog link https://joshua.hu/llm-engineer-review-sast-security-ai-tools... that has Products chapter
I guess mastodon link is simply a confirmation that bugs were indeed bugs, even with wrong code snippets?
Love this take actually and have been working on this and published this way back 2023/2024. Recently, I've been inspired by Claude-code & Cline agentic flow + tool looping, I experimented the same with tools like file_read, dir_list and throwing in few sast tools, security prompts on Wordpress plugin ecosystem (say with 10k-100k active installation) and scanned around ~600 and to my surprise it yielded ~45 critical, ~120 high severity issues and accounting 20% for non-reachability vuln. Spent around 6$ and ~40 million tokens with grok-4 fast reasoning model and the results were impressive, I gave a try with claude-sonnet but significantly rate-limited despite having 50$ credits from anthropic for research.
You can read about my experience here: https://codepathfinder.dev/blog/introducing-secureflow-cli-t...
Old post: https://shivasurya.me/security-reviews/sast/2024/06/27/autom...
Now that is how LLM assistance for coding can be useful. Would be interesting to know which set of tools was used exactly. How might one reproduce this kind of assistance for other code bases?
See Joshua's post for details: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
Tools included ZeroPath, Corgea and Almanax.
[dead]
When I read “we consider nread == 0 as reading a byte and we shouldn’t” I immediately think of all the things that look like bugs but are there because some critical piece of infrastructure relies on that behavior. AI isn’t going to know about that unless you tell it, and the problem is that there’s plenty of folks who have job security precisely because they don’t write that down.
If something is found by Valgrind, we can reproduce it ourselves. Here we get private bug reports found by "his set of AI assisted tools".
The set seems to be:
https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
So he likes ZeroPath. Does that get us any further? No, the regular subscription costs $200 and the free one-time version looks extremely limited and requires yet another login.
Also of course, all low hanging fruit that these tools detect will be found quickly in open source (provided that someone can afford a subscription), similar to the fact that oss-fuzz has diminishing returns.
Presumably the bug reports were private because some of them might relate to curl security.
You can see the fixes that resulted from this in the PRs that mention "sarif" in the curl repository: https://github.com/curl/curl/pulls?q=is%3Apr+sarif+is%3Aclos...
I am a bit worried about the abuse of those tools. I wonder if the tools have policies and mechanisms around those (no clue how, but like forced disclosure if they detect that scanned code is OSS, free usage for OSS teams). Seems otherwise like great tools to accelerate finding 0-zero days. Maybe ever worse, when building supply chain attacks, one could relatively easily test against the detection mechanism before contributing malicious code (wonder if they could detect and block malicious use/probing). However, i guess long term it will make our software more secure. I guess it is always an arms race.
More interesting to me is how to stop these bugs from occurring in the first place. The example given in the thread is the kind of bug that C (and mutation) excels at creating.
The linked blog post https://joshua.hu/llm-engineer-review-sast-security-ai-tools... shows that most of the used tools can be run in ci and comment on the PRs.
And how many would’ve been avoided by finishing the rust port?
> I have already landed 22(!) bugfixes thanks to this, and I have over twice that amount of issues left to go through
Sounds like it was a lot more than 22, assuming most are valid.
Once Claude found a bug in my code but I had to explain the structure of the data. Then and only then it found the bug.
Perhaps Anthropic, OpenAI, and Google could compete by auditing and monitoring the top projects?
I work in a ML security R&D startup called Pwno, we been working on specifically putting LLMs into memory security for the past year, we've spoken at Black Hat, and we worked with GGML (llama.cpp) on providing a continuous memory security solution by multi-agents LLMs.
Somethings we learnt alone the way, is that when it comes to specifically this field of security what we called low-level security (memory security etc.), validation and debugging had became more important than vulnerability discovery itself because of hallucinations.
From our trial-and-errors (trying validator architecture, security research methodology e.g., reverse taint propagation), it seems like the only way out of this problem is through designing a LLM-native interactive environment for LLMs, validate their findings of themselves through interactions of the environment or the component. The reason why web security oriented companies like XBOW are doing very well, is because how easy it is to validate. I seen XBOW's LLM trace at Black Hat this year, all the tools they used and pretty much need is curl. For web security, abstraction of backend is limited to a certain level that you send a request, it whether works or you easily know why it didn't (XSS, SQLi, IDOR). But for low-level security (memory security), the entropy of dealing with UAF, OOBs is at another level. There are certain things that you just can't tell by looking at the source but need you to look at a particular program state (heap allocation (which depends on glibc version), stack structure, register states...), and this ReACT'ing process with debuggers to construct a PoC/Exploit is what been a pain-in-the-ass. (LLMs and tool callings are specifically bad at these strategic stateful task, see Deepmind's Tree-of thoughts paper discussing this issue) The way I've seen Google Project Zero & Deepmind's Big Sleep mitigating this is through GDB scripts, but that's limited to a certain complexity of program state.
When I was working on our integration with GGML, spending around two weeks on context, tool engineering can already lead us to very impressive findings (OOBs); but that problem of hallucination scales more and more with how many "runs" of our agentic framework; because we're monitoring on llama.cpp's main branch commits, every commits will trigger a internal multi-agent run on our end and each usually takes around 1 hours and hundreds of agent recursions. Sometime at the end of the day we would have 30 really really convincing and in-depth reports on OOBs, UAFs. But because how costly to just validate one (from understanding to debugging, PoC writing...) and hallucinations, (and it is really expensive for each run) we had to stop the project for a bit and focus solving the agentic validation problem first.
I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more.
> I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more
Thanks for sharing your experience ! It correlates with this recent interview with Sutton [1]. That real intelligence is learning from feedback with a complex and ever changing environment. What an LLM does is to train on a snapshot of what has been said about that environment and operate on only on that snapshot.
[1] https://www.dwarkesh.com/p/richard-sutton
Yes the AI gave him leads and a talented programmer still has to follow up on them one by one.
It’s like a police facial recognition, they can help police but there is no way they are “replacing police”
Link should be updated to this
https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
I've added that link to the toptext, but I can't quite tell which URL should be the starting point.
Love this one:
https://mastodon.social/@icing@chaos.social/1152440641434357...
>tldr
>The code was correct, the naming was wrong.
Also, what an intense presentation style.
Red borders around every slide and very flashy images
So whereas previously, repo owners were getting flooded with AI-generated PRs that were complete slop, now they're going to be flooded with PRs that contain actual bugfixes. IDK which problem is worse!
Oh so AI usage news could be positive after all. Not to undermine huge issue of slop reports spam, but I'm so happy to see something besides doomerism
"AI" tools can be very powerful once you approach them as what they are: very good pattern matchers and generators. This ability far surpasses anything a human could do. Detecting potential issues in software is a great application of the technology.
The key word is "potential", though. They're still wildly unpredictable and unreliable, which is why an expert human is required to validate their output.
The big problem is the people overhyping the technology, selling it as "AI", and the millions deluded by the marketing. Amidst the false advertising, uncertainty, and confusion, people are forced to speculate about the positive and negative impacts, with wild claims at both extremes. As usual, the reality is somewhere in the middle.
There are some good SAST scanners and many bad commercial scanners.
Many people advocate for the use of AI technology for SAST testing. There are even people and companies that deliver SAST scanners based on AI technology. However: Most are just far from good enough.
In the best case scenario, you’ll only be disappointed. But the risk of a false sense of security is enormous.
Some strong arguments against AI scanners can be found on https://nocomplexity.com/ai-sast-scanners/
Notice it was 'a set of tools'
They're using it correctly. It's a system of tools, not an autopilot.
I did not read it, but this article from the contributor should contain more details: https://joshua.hu/llm-engineer-review-sast-security-ai-tools... (mentioned in https://mastodon.social/@bagder/115241413210606972).
It's weird that the discussion has collapsed down to "autopilots" vs. "abstention". I'm thrilled to be converging on an understanding that it instead "people who understand what they're trying to do" vs. "vibe coders".
In defense of the cynics, I get the impression in a situation where (a) there's a lot of company marketing hype in such a competitive market that begs cynicism, and (b) we're constantly learning the boundary of trained LLMs can actually do (and can't), as well as unusual emergent workflows, that really do make a difference.
Well, that's how Mr. Stenberg described it, but he wasn't the one using them. I don't know how the contributor feels about his AI tool(s).
I haven't read it yet, but later in the mastodon thread, stenberg says "this is [the contributor's] (long) blog post on his work: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...".
[dead]
Something sounds fishy in this. Has these bugs really been found by AI? (I don't think they were).
If you read Corgea's (one of the products used) "whitepaper", it seems that AI is not the main show:
> BLAST addresses this problem by using its AI engine to filter out irrelevant findings based on the context of the application.
It seems that AI is being used to post-process the findings of traditional analyzers. It reduces the amount of false positives, increasing the yield quality of the more traditional analyzers that were actually used in the scan.
Zeropath seems to use similar wording like "AI-Enabled Triage" and expressions like "combining Large Language Models with AST analysis". It also highlights that it achieves less false positives.
I would expect someone who developed this kind of thing to setup a feedback loop in which the AI output is somehow used to improve the static analysis tool (writing new rules, tweaking existing ones, ...). It seems like the logical next step. This might be going on on these products as well (lots of in-house rule extensions for more traditional static analysis tools, written or discovered with help of AI, hence the "build with AI" headline in some of them).
Don't get me wrong, this is cool. Getting an AI to triage a verbose static analysis report makes sense. However, it does not mean that AI found the bugs. In this model, the capabilities of finding relevant stuff are still capped at the static analyzer tools.
I wonder if we need to pay for it. I mean, now that I know it is possible (at least in my head), it seems tempting to get open source tools, set them to max verbosity, and find which prompts they are using on (likely vanilla) coding models to get them to triage the stuff.
Hi there, I'm Ahmad, CEO at Corgea, and the author of the white paper. We do actually use LLMs to find the vulnerabilities AND triage findings. For the majority of our scanning, we don't use traditional static analysis. At the core of our engine is the LLM reading the line of code to find CWEs in them.
Looks like you're reacting to the Hacker News title here, which is currently " Daniel Stenberg on 22 curl bugs found by AI and fixed"
That's an editorialized headline (so it may get fixed by dang and co) - if you click through to what Daniel Stenberg said he was more clear:
> Joshua Rogers sent us a massive list of potential issues in #curl that he found using his set of AI assisted tools.
AI-assisted tools seems right to me here.
If the title changes, it is still a valid critique of the tools, how they might work, and a possible way of getting them for free.
Also, think about it: of course I read Joshua's report. Otherwise, how could I have known the names of the products he used?
I don't think many people here are interested in how something works. They want to see the headline "Curl developer finally convinced by AI!" and otherwise drop anecdotes about Claude Code etc.
All comments that want to know more are at the bottom.
It’s clear my attempt to keep the gist of what Daniel said while keeping under the title character count didn’t hit the mark.
How would you have worded it?
Always tricky! In this case maybe the following:
Daniel Stenberg on 22 curl bugs reported using AI-assisted security scanners
That doesn’t really convey that these bug reports were for real issues and greatly appreciated unlike the slop that Daniel is known for complaining about which I think that’s the real story here.
I will spend longer considering my title next time.
Cheers!
Hi, I'm Etienne, one of the cofounders @ ZeroPath.
We do not use traditional static analyzers; our engine was built from the ground up to use LLMs as a primitive. The issues ZeroPath identified in Joshua's post were indeed surfaced and triaged by AI.
If you're interested in how it works under the hood, some of the techniques are outlined here: https://zeropath.com/blog/how-zeropath-works
Hi! Thanks for the reply.
Joshua describes it as follows: "ZeroPath takes these rules, and applies (or at least the debug output indicates as such) the rules to every .. function in the codebase. It then uses LLM’s ability to reason about whether the issue is real or not."
Would you say that is a fair assessment of the LLM role in the solution?
I suppose the downvoters all have subscriptions to the tools and know exactly how the tools work while leaving the rest of us in the dark.
Even Joshua's blog post does not clearly state which parts and how much is "AI". Neither does the pdf.
[flagged]
[flagged]
What does "even at Matasano" mean? Matasano hasn't existed for over 12 years.
My mistake. I confused Mataroa with Matasano:
https://ludic.mataroa.blog/blog/contra-ptaceks-terrible-arti...
I think you owe me a better apology than that. I disagree with your evaluation of the response to that post, strongly, but more importantly I didn't bring it up in the first place, and the claim you made about it (intentionally or not; awfully weird to land at my company's name) was personal and scurrilous.
Your call! I'm moving on.
I emailed you about two hours ago based on the last email exchange we had, about 7 years ago. I included the email addresses I had for you at the time, which included Matasano. Presumably this is why someone mentioned it in a comment to you.
did you get my email?
refulgentis mentioned your fly.io article as evidence that the "AI" science is settled.
I remembered quite a bit of a blowback and unfortunately saw the Mataroa article last week. It contains a reference to Matasano; I didn't look closely at the URL and mentally classified Matasano as a classic informal hacker company with extreme free speech blogs, where you could criticize the founder with a typical hyperbolic article.
That is all there is to it.
I assumed based on your post and the post you replied to that it is literally impossible to prove any AI is involved, and I trust both of you on that.
Given that, I'm afraid all the interlocution I have to offer is the thing you commented on, the mind of a downvoter, i.e. positing that every downvoter must have details, including details we[1] can't find.
Past that, I'm afraid to admit I am having difficulty understanding how the slides are related, and I don't even know what Matasano is -- is that who owns fly.io? I thought they were "indie" -- I'm embarrassed to admit I thought Monsanto at first. I do know how much I've used AI to code, so I can vouch for tptacek's post.
[1] royal we, i.e. I trust you and OP so completely on what it findable vs. not findable that I trust we can't establish with 100% certainty any sort of AI-based thingy was used at all. To be clear, too, 100% is always too high of a bar, I mean to say we can even't establish at 90% confidence. Even 1% confidence. If all we have is their word to go on, it's impossible.
Matasano was a software security company I cofounded in 2005 and sold to NCC Group in 2012. Super weird pull for this thread.
Do you believe AI is at the core of these security analyzers? If so, why the personal story blogpost? You can just explain me in technical terms why is that so.
Claiming to work for Google does not work as an authority card for me, you still have to deliver a solid argument.
Look, AI is great for many things, but to me these products sounds like chocolate that is actually just 1% real chocolate. Delicious, but 99% not chocolate.
I had a conversation in a chat room yesterday about AI-assisted math tutoring where a skeptic said that the ability of GPT5 to effortlessly solve quotient differentials or partial fraction decomposition or rational inequalities wasn't indicative of LLM improvements, but rather just represented the LLMs driving CAS tools and thus didn't count.
As a math student, I can't possibly care less about that distinction; either way, I paste in a worked problem solution and ask for a critique, and either way I get a valid output like "no dummy multiply cos into the tan before differentiating rather than using the product rule". Prior to LLMs, there was no tool that had that UX.
In the same way: LLMs are probably mostly not off the top of their "heads" (giant stacks of weight matrices) axiomatically deriving vulnerabilities, but rather just doing a very thorough job of applying existing program analysis tools, assembling and parallel-evaluating large numbers of hypothesis, and then filtering them out. My interlocutor in the math discussion would say that's just tool calls, and doesn't count. But if you're a vulnerability researcher, it doesn't matter: that's a DX that didn't exist last year.
As anyone who has ever been staffed on a project triaging SAST tool outputs before would attest: it extremely didn't exist.
I don't care if it counts as true LLM brilliance or not.
If it doesn't matter if it's AI or not, just that they're good tools, why even advertise the AI keyword all over it? Just say "best in class security analysis toolset". It's proprietary anyway, you can't know how much of it is actually AI (unless you reproduce its results, which is the core argument you missed here).
Because that's not accurate. The underlying program analysis tooling already existed, but the LLM glue logic is what makes it effective. You could as a human replicate it with those preexisting tools, but you won't, in the same way you wouldn't model your whole project in a prover and solve it with formal methods; it was possible, but that possibility isn't meaningful.
> the LLM glue logic is what makes it effective
Allegedly (it's proprietary, we don't know). Maybe it's the triage approach, and there are undiscovered non-LLM triage techniques that would surpass it.
If I were to guess, the AI naming was for marketing purposes (to ride on the hype train), not because it accurately describes the product (even though it might accurately describe the product).
Most importantly, how is that it's so effective? I want to know. Perhaps you and some others just want to celebrate an LLM win. That's fine, but I want to know how it works.
I'd say my guess is fair, and it's a viable approach for someone trying to create a similar tool. If I were to try and replicate this, I would definitely start with an existing static analyzer. For example, I would do it with phpstan (just because I know it a little bit better).
I would extend it so it becomes more verbose than what it currently is (something humans don't want, but machines might benefit from it). Perhaps I would introduce some rules that make it report things that aren't even issues, but just information I can gauge from the AST (like, does this controller has a middleware? If so, emit something in the report). Then I would attempt to use that enriched report as the input for a coding model, and experiment with different prompts and different granularity units on the input.
It sounds reasonable, doesn't it? I could describe that approach as "LLM right in the core of the solution", but I know by heart that in that arrangement, the quality of the final product is still capped by the static analyzer and what it can detect and describe. It doesn't matter that the LLM is what makes it better. My wheat farm is still about wheat, not the fancy sift I recently bought to separate it from the chaff.
I don't understand why this sounds so offensive to some of the readers here. I was just thinking "how would I use AI in such a product" and the only way I can come up with is this way in which is not the main show.
I mean, my experience with LLMs also confirm that. Prompting "find me bugs" or stuff like that almost never works. It works better if I get an error and ask it to explain it to me, giving the application context. The static analyzer is there to give this initial kick, to create these nucleation sites in which the LLM will crystalize answers upon.
This sounds like the most viable, easy to make product that can find bugs with LLMs. It's only offensive if that's actually what these products are doing, it's not supposed to be known and I struck a nerve or something.
Why does it work well now, after 20 years of this kind of tooling being next to useless? Do you work in this space? How much about how bad SAST tools are do I need to explain?
Maybe it's unleveraged potential, I don't know. I am also not entirely convinced that they're next to useless. Sanitizers, for example, are excellent for mitigating all sorts of security issues. Those are traditional static analysis tools (that, by the way, fit the arrangement I described of using these reports as nucleation sites for LLM triage).
I did walked you through how I would do it. Would you change your response if I said I work in this space? It seems like an irrelevant point in this discussion.
You don't need to explain anything. This is on a flagged thread, obscure and unseen. I'm actually surprised by how invested you are in this apparently irrelevant matter.
I'm a software security person! This is not irrelevant to me.
In summary: the existing program analysis tooling in this space has been ineffective for decades, despite hundreds of millions of dollars invested in the tooling. If it is effective now, that strongly indicates that the LLM component of it isn't irrelevant; nothing else in the field has changed.
Note that everybody in this story concedes the LLM involvement. The only person who isn't is you, and you're not actually involved. (I'm not either, but I'm agreeing with --- checks again --- everybody involved in the story).
I concede the LLM involvement. But I want to be more specific in the description of the role it plays in the solution.
If it is a central role, then there is nothing to loose from describing it better. That's why this feels so strange. You disagree with me, but you don't present an arrangement in which the LLM plays a role different to what I described. In fact, no one here did. It's like you're not disagreeing with me, but trying to make me stop describing how to achieve a similar quality system out of free pieces.
I don't mean to aggravate you. I do mean to offer some insight in the mindset of the people the person I was replying to was puzzled by. I'm calmed by the fact that if we're both here, we both value one of the HN sayings I'm very fond of: come with curiosity.
> Do you believe AI is at the core of these security analyzers?
Yes.
> If so, why the personal story blogpost?
When I am feeling intensely, and people respond to me as I'm about to respond to you, I usually get very frustrated. Apologies in advance if you suffer from that same part of being human, I don't mean anything about you or your positions by this:
I don't know what you mean.
Thus, I may be answering wrong with the following: the person I replied to indicated all downvoters must know every detail, and as the, well lets use your phrasing, personal story blogpost, I just assume you mean my comment, leads with: "I believe there's a little more going on than everyone knowing every detail already, or presumably, being wrong to downvote. Full case study of a downvoter at work:"
> Claiming to work for Google
I claimed the opposite! I'm a jobless hack :) (quit in 2023)
> does not work as an authority card for me,
Looking at it, the thing isn't "I worked at Google therefore AI good" it's "I worked at Google and on a specific well-known project, the company's design language, used AI pre-ChatGPT to great effect. It's unclear to me why this use case would be unbelievable years later"
> you still have to deliver a solid argument.
What are we arguing? :) (I'm serious! Apologies, again, if it comes off as flippant. If you mean I need to deliver a solid argument the tools must have AI, I assume if said details were available you would have found them, you seem well-considered and curious. I meant to explain the mind of a downvoter who yet cannot recite details as yet unavailable to the public to the person I replied to, not to verify the workflow step by step.)
The argument is that these high-quality security analyzers seem to use AI as a triage mechanism, and the quality of the analysis is still capped by the quality of the static analysis tool.
One of the tools provide a whitepaper, that you can read here:
https://corgea.com/blog/whitepaper-blast-ai-powered-sast-sca...
It seems to explicitly put AI in this coadjuvant role, contradicting the HN title "found by AI".
Neither me or the other commenter actually dismissed AI as useless. I can't speak for him, but to me, it seems actually useful in this arrangement. However, not "I'll pay for a subscription" levels of useful.
Since it's just triage, it seems that trying to reproduce the idea using free tools might be worth a shot (and that's the idea of finding out where the AI component lies in the system). What I said is very doable (plug the output of traditional tools into vanilla coding LLMs prompts). It also looks a lot like this Corgea schematic:
https://framerusercontent.com/images/EtFkxLjT1Ou2UTPACObJbR2...
I mean, it's very brave to explain a downvote, but in this case, it seems that you missed the opportunity to make sense.
[flagged]
Somehow related:
You did this with an AI and you do not understand what you're doing here: https://news.ycombinator.com/item?id=45330378
Yeah, I'm quite confused. See also https://news.ycombinator.com/item?id=44411185 ("AI slop security reports submitted to curl"), and especially https://news.ycombinator.com/item?id=43907376 ("Curl: We still have not seen a valid security report done with AI help")
AI is non-deterministic as we know.
That makes its results unpredictable.
So don’t have AI create your bugs.
Instead have your AI look for problems - then have it create deterministic tools and let tools catch the issues in a repeatable, understandable, auditable way. Have it build short, easy to understand scripts you can commit to your repo, with files and line numbers and zero/nonzero exit codes.
It’s that key step of transforming AI insights into detection tools that transforms your outcomes from probabilistic to deterministic. Ask it to optimize the tools so they run in seconds. You can leave them in the codebase forever as linters, integrate them in your CI, and never have that same bug again.
I have to admit, I expected a couple of "You should rewrite it in Rust" hipster posts by now... Maybe they caught on that those types of posts were not having the effect they thought they would? I kid, I kid... mostly
[flagged]
THe Canadian gubernment should probably get a bug bounty program so I can present some of my findings to the them that I found using ai and tested or mapped things out on some of their public facing apps on the app store/play store.