> DeepSearch runs through an iterative loop of searching, reading, and reasoning until it finds the optimal answer. [...]
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports
Given these definitions, I think DeepSearch is the more valuable and interesting pattern. It's effectively RAG built using tools in a loop, which is much more likely to answer questions effectively than more traditional RAG where there is only one attempt to find relevant documents to include in a single prompt to an LLM.
DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
> DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
I think that if done well deep research can be more than that. At a minimum, I would say that before "deep search" you'd need some calls to an LLM to figure out what to look for, what places would be best to look for (i.e. sources, trust, etc), how to tabulate the data gathered and so on. Just as deep search is "rag w/ tools in a loop", so can (should) be deep research.
Think of the analogy of using aider straight up going to code or using it to first /architect and then code. But for any task that lends itself to (re)searching. At least it would catch useless tangents faster.
At the end of the day, what's fascinating about LLM based agents is that you can almost always add another layer of abstraction on top. No matter what you build, you can always come from another angle. That's really cool imo, and something Hassabis has hinted lately in some podcasts.
So I’ve started a thing with Jim’s, and the first effort I am doing is erring the “tone” meaning I’m building a project template that will keep the bots focused.
I think that one part of the deep loop needs to be a check-in on expectations and goals…
So instead of throwing a deep task: I find that bots work better in small iterative chucks of objectives..
I haven’t formulated it completely yet but as an example ive been working extensively with cursors whole anthropic abstraction ai as a service:
So many folks suffer from “generating” quagmire;
And I found that telling the bot to “break any response into smaller chunks to avoid context limitations” works incredibly well…
So when my scaffold is complete the goal is to use Fabric Patterns for nursery assignments to the deep bots.. whereby they constantly check in.
Prior to “deep” things I found this to work really well by telling the bots about obsessively development_diray.md and .json tracking of actions (even still their memory is super small, and I envisioned a multi layer of agents where the initial agents actions feed the context of agents who follow along and you have a waterfall of context between agents so as to avoid context loss on super deep iterative research…
(I’ll type out something more salient when I have a KVM…
Right - I'm finding the flawed Deep Research tools useful already, but what I really want is much more control over the sources of information they use.
Sadly I think that’s why non-open source commercial deep (re)search implementations are going to be largely useless. Even if you’re using a customized end point for search like Kagi, the sources are mostly garbage and no one except maybe Google Books has the resources and legal cover to expand that deep search into books, which are much better sources.
First, SimonW, I devour everything you write and appreciate you most in the AI community and recommend you from 0 all the way to 1!!!
Thank you.
-
Second, thank you for bringin up Jina, I recently discovered it and immediate began building a Thing based on it:
I want to use its functions to ferret-out all the entanglements from the roster from the WEF Leadership roster, similar to the NGO fractal connections - I’m doing that with every WEF member, through to Congress.
I would truly wish to work with you on such, I so inclined..
I prefer to build “dossiers” rather than reports, and represented in json schemas
I’m on mobile so will provide more details when at machine…
Looping through a dossier of connections is much more thoughtful than a “report” imo.
I need to see you on someone’s podcast, else you and I should make one!
What I want is a “podcast” with audience participation..
The lex fridman DeepSeek episode was so awesome but I have so many questions and I get exceedingly frustrated when lex doesn’t ask what may seem obv to us HNers…
-
Back topic:
Reports are flat; dossiers re malleable.
As I mentioned my goal is fractal visuals (in minamGL) of the true entanglements from the WEF out.
Much like mike Benz on usAid - using jina deep research, extraction etc will pull back the veil on the truth of the globalist agenda seeking control and will reveal true relationships, loyalties and connections.
It been running through my head for decades and I finally feel that jina is a tool that can start to reveal what myself and so many others can plainly see but can’t verify.
> DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
Yup, I got the same impression reading this article - and the Jina one, too. Like with langchain and agents, people are making chained function calls or a loop sound like it is the second coming, or a Nobel prize-worthy discovery. It's not - it's obvious. It's just expensive to get to work reliably and productize.
It is obvious now, but I will defend langchain which is catching a stray here.
Folks associated with that project were some of the first people to talk about patterns like this publicly online. With GPT-3-Davinci it was not at all obvious that chaining these calls would work well, to most people, and I think the community and team around langchain and associated projects did a pretty good job of helping to popularize some of the patterns such as they are now obvious.
I thought DeepResearch has the AI driving the process because it's been trained to do so vs DeepSearch is something like langchain + prompt engineering?
There are three different launched commercial products called "Deep Research" right now - from Google Gemini, OpenAI and Perplexity. There are also several open source projects that use the name "Deep Research".
DeepResearch (note the absence of the space character) is the name that Han Xiao proposes for the general pattern of generating a research-style report after running multiple searches.
You might implement that pattern using prompt engineering or using some custom trained model or through other means. If the eventual output looks like a report and it ran multiple searches along the way it fits Han's "DeepResearch" definition.
Maybe for the same reason JavaScript is named Java Script and looks like Java (instead of being Scheme which it almost was, twice)? That is, purposeful name collision with an existing tool/buzzword that's very popular with non-technical management and corporate executives.
I mean, these companies have enough issues with branding and explaining the differences between AI products, even those they provide (GPT-4, o1, o3-mini, etc. - do most openAI users know the differences between them or what they each specialise at?).
I guess they will take any opportunity to follow the leader here if they worry that they're at risk of similar branding issues here too.
> DeepResearch is a cosmetic enhancement that wraps the results in a "report"
No, that's not what Xiao said here. Here's the relevant quote
> It often begins by creating a table of contents, then systematically applies DeepSearch to each required section – from introduction through related work and methodology, all the way to the conclusion. Each section is generated by feeding specific research questions into the DeepSearch. The final phase involves consolidating all sections into a single prompt to improve the overall narrative coherence.
(I also recommend that you stare very hard at the diagrams.)
Let me paraphrase what Xiao is saying here:
A DeepSearch is a primitive — it does mostly the same thing a regular LLM query does, but with a lot of trained-in thinking and searching work, to ensure that it is producing a rigorous answer to your question. Which is great: it means that DeepSearch is more likely to say "I don't know" than to hallucinate an answer. (This is extremely important as a building block; an agent needs to know when a query has failed so it can try again / try something else.)
However, DeepSearch alone still "hallucinates" in one particular way: it "hallucinates understanding" of the topic, thinking that it already has a complete mental toolkit of concepts needed to solve your problem. It will never say "solving this sub-problem seems to require inventing a new tool" and so "branch off" to another recursed DeepSearch to determine how to do that. Instead, it'll try to solve your problem with the toolkit it has — and if that toolkit is insufficient, it will simply fail.
Which, again, is great in some ways. It means that a single DeepSearch will do a (semi-)bounded amount of work. Which means that the costs of each marginal additional DeepSearch call are predictable.
But it also means that you can't ask DeepSearch itself to:
• come up with a mathematical proof of something, where any useful proof strategy will implicitly require inventing new math concepts to use as tools in solving the problem.
• do investigative journalism that involves "chasing leads" down a digraph of paths; evaluating what those leads have to say; and using that info to determine new leads.
• "code me a Facebook clone" — and have it understand that doing so involves iteratively/recursively building out a software architecture composed of many modules — where it won't be able to see the need for many of those modules at "design time", but will only "discover" the need to write them once it gets to implementation time of dependent modules and realizes that to achieve some goal, it must call into some code / entire library that doesn't exist yet. (And then make a buy-vs-build decision on writing that code vs pulling in a dependency... which requires researching the space of available packages in the ecosystem, and how well they solve the problem... and so on.)
A DeepResearch model, meanwhile, is a model that looks at a question, and says "is this a leaf question that can be answered directly — or is this a question that needs to be broken down and tackled by parts, perhaps with some of the parts themselves being unknowns until earlier parts are solved?"
A DeepResearch model does a lot of top-level work — probably using DeepSearch! — to test the "leaf-ness" of your question; and to break down non-leaf questions into a "battle plan" for solving the problem. It then attempts solutions to these component problems — not by calling DeepSearch, but by recursively calling itself (where that forked child will call DeepSearch if the sub-problem is leaf-y, or break down the sub-problem further if not.)
A DeepResearch model will then takes the derived solutions for dependent problems into account in the solution space for depending problems. (A DeepResearch model may also be trained to notice when it's "worked into a corner" by coming up with early-phase solutions that make later phases impossible; and backtracking to solve the earlier phases differently, now with in-context knowledge of the constraints of the later phases.)
Once a DeepResearch model finds a successful solution to all subproblems, it takes the hierarchy of thinking/searching logs it generated in the process, and strips out all the dead-ends and backtracking, to present a comprehensible linear "success path." (Probably it does this as the end-step of each recursive self-call, before returning to self, to minimize the amount of data returned.)
Note how this last reporting step isn't "generating a report" for human consumption; it's a DeepResearch call "generating a report" for its parent DeepResearch call to consume. That's special sauce. (And if you think about it, the top-level call to this whole thing is probably going to use a non-DeepResearch model at the end to rephrase the top-level DeepResearch result from a machine-readable recurse-result report into a human-readable report. It might even use a DeepSearch model to do that!)
---
Bonus tangent:
Despite DeepSearch + DeepResearch using a scientific-research metaphor, I think an enlightening comparison is with intelligence agencies.
DeepSearch alone does what an individual intelligence analyst does. You hand them an individually-actionable question; they run through a "branching, but vaguely bounded in time" process of thinking and searching, generating a thinking log in the process, eventually arriving at a conclusion; they hand you back an answer to your question, with a lot of citations — or they "throw an exception" and tell you that the facts available to the agency cannot support a conclusion at this time.
Meanwhile, DeepResearch does what an intelligence agency as a whole does:
1. You send the agency a high-level strategic Request For Information;
2. the agency puts together a workgroup composed of people with trained-in expertise with breaking down problems (Intelligence Managers), and domain-matter experts with a wide-ranging gestalt picture of the problem space (Senior Intelligence Analysts), and tasks them with breaking down the problem into sub-problems;
3. some of these sub-problems are actionable — they can be assigned directly for research by a ground-level analyst; some of these sub-problems have prerequisite work that must be done to gather intelligence in the field; and some of these sub-problems are unknown unknowns — missing parts of the map that cannot be "planned into" until other sub-problems are resolved.
4. from there, the problem gets "scheduled" — in parallel, (the first batch of) individually-actionable questions get sent to analysts, and any field missions to gather pre-requisite intelligence are kicked off for planning (involving spawning new sub-workgroups!)
5. the top-level workgroup persists after their first meeting, asynchronously observing the reports from actionable questions; scheduling newly-actionable questions to analysts once field data comes in to be chewed on; and exploring newly-legible parts of the map to outline further sub-problems.
6. If this scheduling process runs out of work to schedule, it's either because the top-level question is now answerable, or because the process has worked itself into a corner. In the former case, a final summary reporting step is kicked off, usually assigned to a senior analyst. In the latter case, the workgroup reconvene to figure out how to backtrack out of the corner and pursue alternate avenues. (Note that, if they have the time, they'll probably make "if this strategy produces results that are unworkable in a later step" plans for every possible step in their original plan, in advance, so that the "scheduling engine" of analyst assignments and fieldwork need never run dry waiting for the workgroup to come up with a new plan.)
You're right, Han didn't define DeepResearch as "a cosmetic enhancement". I quoted his sentence long definition:
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports.
But then called it "a cosmetic enhancement" really to be slightly dismissive of it - I'm a skeptic of the report format because I think the way it's presented makes the information look more solid than it actually is. My complaint is at the aesthetic level, not relating to the (impressive) way the report synthesis is engineered.
So yeah, I'm being inaccurate and a bit catty about it.
Your explanation is much closer to what Han described, and much more useful than mine.
One of my co-workers joked at the time that "sure AlphaGO beat Lee Sedol at GO, but Lee has a much better self-driving algorithm."
I thought this was funny at the time, but I think as more time passes it does highlight the stark gulf that exists between the capability of the most advanced AI systems and what we expect as "normal competency" from the most average person.
> it does highlight the stark gulf that exists between the capability of the most advanced AI systems and what we expect as "normal competency" from the most average person
Yes, but now we're at the point where we can compare AI to a person, whereas five years ago the gap was so big that that was just unthinkable.
Worse than that: it seems that it's much easier to make computer achieve superhuman feats in cognitive work, than it is to make it do even most basic physical interactions with the real world.
In short: the natural order of things is that computers are better at thinking, and people are better at manual labor. Which is the opposite of what we wanted.
Think this captures one of the bigger differences between what Open AI offers and what others offer using the same name. Funnily enough, Google's Gemini 2.0 Flash also has a native integration to google search[1]. They have not done it with their Thinking model. When they do we will have a good comparison.
One of the implications of OpenAI's DR is that frontier labs are more likely to train specific models for a bunch of tasks, resulting in the kind of quality wrappers will find hard to replicate. This is leading towards model + post training RL as a product, instead of keeping them separate from the final wrapper as product. Might be interesting times if the trajectory continues.
PS: There is also genspark MOA[2] which creates an indepth report on a given prompt using mixtures of agents. From what i have seen in 5-6 generations, this is very effective.
Deep Search is RAG - that is, if we're still expanding the acronym instead of treating it as a word that just means "queries a vector database".
Prediction for Next Hot Thing in Q4 2025 / Q1 2026: someone will make the Nobel prize-worthy discovery that you can stuff results of your deep search into a database (vector or otherwise) and then use it to improve the ability to compile a higher-quality report from much larger amount of sources.
We'll call it DeepRAG or Retrieval Augmented Deep Research or something.
Prediction for Q2 2026: next Nobel prize awarded for realizing you may as well stop treating report generation as the core aspect of "deep research" (as it obviously makes no sense, but hey, time traveler's spoilers, sorry!), and stop at the "stuff search results into a database" and let users "chat with the search results", making the research an interactive process.
We'll call this huge scientific breakthrough "DeepRAG With Human Feedback", or "DRHF".
DR is a nice way to gather information, when it works, and then do the real research yourself from a concentrated launching point. It helps me avoid ADD braining myself into oblivion every time I search the internet.
The fatal mistake is thinking that the LLM is now wiser for having done it. When someone does their research, they are now marginally more of an authority on that topic than everyone else in the room, all else being equal. But for LLMs, it's not like they have suddenly acquired more expertise on the subject now that they did this survey. So it's actually pretty shallow, not deep, research.
It's a cool capability, and a nifty way to concentrate information, but much deeper capabilities will be required to have models that not only truly synthesize all that information, but actively apply it to develop a thesis or further a research program.
Truthfully, I don't see how this is possible within the transformer architecture, with its lack of memory or genuine statefulness and therefore absence of persistent real time learning. But I am generally a big transformer skeptic.
Are you telling me that AI's are starting to diverge and that we might get a combinatorial explosion of reasoning paths that will produce so many different agents that we won't know which one can actually become AGI?
AGI is about performing actions with a high multi-task intelligence. Only the top right corner (Deep, Trained) has any hope of getting closer to AGI. The rest can still be useful for specific tasks, e.g. "deep-research".
> Prior to gpt-3 AI was rarely used in marketing or to talk about any number of ML methods.
In the decade prior to GPT-3, AI was frequently used in marketing to talk about any ML methods, up to and including linear regression. This obviously ramped up heavily after "Deep Learning" got coined as a term.
AI now actually means something in marketing, but the only reason for that is that calling out to an LLM is even simpler than adding linear regression to your product somewhere.
As for AGI, that was a hot topic in some circles (that are now dismissed as "AI doomers") for decades. In fact, OpenAI started with people associating or at least within the sphere of influence of LessWrong community, which both influenced the naming and perspective the "LLM industry" started with, and briefly put the outputs of LessWrong into spotlight - which is why now everyone uses terms like "AGI" and "alignment" and "AI safety".
However, unlike "alignment", which got completely butchered as a term, AGI still roughly means what it meant before - which is basically a birth of a man-made god. That's true of AGI as "meaningless marketing term" too, if people so positive on it paused to follow through the implications beyond "oh it's like ChatGPT, but actually good at everything".
Well now it is not: it is now "the difference between something with outputs that sounds plausible vs something with outputs which are properly checked".
History begs to differ – one of the biggest learning is that larger, generic models always win (where generalization pays off – ie. all those agents and what-not <<doesn't apply to specialized models like alphago or alphafold, those are not general models>>).
> one of the biggest learning is that larger, generic models always win
You’re confusing several different ideas here.
The idea you’re talking about is called “the bitter lesson.” It (very basically) says that a model with more compute put behind it will perform better than a cleverer method which may use less compute. Has nothing to do with being “generic.” It’s also worth noting that, afaik, it’s an accurate observation, but not a law or a fact. It may not hold forever.
Either way, I’m not arguing against that. I’m saying that LLMs are too general to be useful in specific, specialized, domains.
Sure bigger generic models perform (increasingly marginally) better at the benchmarks we’ve cooked up, but they’re still too general to be that useful in any specific context. That’s the entire reason RAG exists in the first place.
I’m saying that a language model trained on a specific domain will perform better at tasks in that domain than a similar sized model (in terms of compute) trained on a lot of different, unrelated text.
For instance, a model trained specifically on code will produce better code than a similarly sized model trained on all available text.
I really hope that example makes what I’m saying self-evident.
You're explaining it nicely and then seem to make mistake that contradicts what you've just said – because code and text share domain (text based) – large, generic models will always out-compete smaller, specialized ones – that's the lesson.
If you'd compare it with ie. model for self driving cars – generic text models will not win because they operate in different domain.
In all cases trying to optimize on subset/specialized tasks within domain is not worth the investment because state of art will be held by larger models working on the whole available set.
I think they're wrong, but you're also making a related mistake here:
> You're explaining it nicely and then seem to make mistake that contradicts what you've just said – because code and text share domain (text based)
"Text" is not the domain that matters.
The whole trick behind LLMs being as capable as they are, is that they're able to tease out concepts from all that training text - concepts of any kind, from things to ideas to patterns of thinking. The latent space of those models has enough dimensions to encode just about any semantic relationship as some distinct direction, and practice shows this is exactly what happens. That's what makes style transfer pretty much a vector operation (instead of "-King +Woman", think "-Academic, +Funny"), why LLMs are so good at translating between languages, from spec to code, and why adding modalities worked so well.
With LLMs, the common domain between "text" and "code" is not "text", but the way humans think, and the way they understand reality. It's not the raw sequences of tokens that map between, say, poetry or academic texts and code - it's the patterns of thought behind those sequences of tokens.
Code is a specific domain - beyond being the lifeblood of programs, it's also an exercise in a specific way of thinking, taken up to 11. That's why learning code turned out to be crucial for improving general reasoning abilities of LLMs (the same is, IMO, true for humans, but it's harder to demonstrate a solid proof). And conversely, text in general provides context for code that would be hard to infer from code alone.
> because code and text share domain (text based) – large, generic models will always out-compete smaller, specialized ones – that's the lesson
All digital data is just 1s and 0s.
Do you think a model trained on raw bytes would perform coding tasks better than a model trained on code?
I have a strong hunch that there’s some Goldilocks zone of specificity for statistical model performance and I don’t think “all text” is in that zone.
Here is the article for “the bitter lesson.” [0]
It talks about general machine learning strategies which use more compute are better at learning a given data set then a strategy tailor made for that set.
This does not imply that training on a more general dataset will yield more performance than using a more specific dataset.
The lesson is about machine learning methods, not about end-model performance at a specific task.
Imagine a logistic regression model vs an expert system for determining real estate prices.
The lesson tells us that, given the more and more compute, the logistic regression model will perform better than the expert system.
The lesson does not imply that when given 2 logistic regression models, one trained on global real estate data and one trained on local, that the former would outperform.
I realize this is a fine distinction and that I may not be explaining it as well as I could if I were speaking, but it’s an important distinction nonetheless.
> In natural language processing (NLP) terms, this is known as report generation.
I'm happy to see some acknowledgement of the world before LLMs. This is an old problem, and one I (or my team, really) was working on at the time of DALL-E & ChatGPT's explosion. As the article indicated, we deemed 3.5 unacceptable for Q&A almost immediately, as the failure rate was too high for operational reporting in such a demanding industry (legal). We instead employed SQuAD and polished up the output with an LLM.
These new reasoning models that effectively retrofit Q&A capabilities (an extractive task) onto a generative model are impressive, but I can't help but think that it's putting the cart before the horse and will inevitably give diminishing returns in performance. Time will tell, I suppose.
Its interesting it says Grok excels at report generation, because I've found myself asking it to give me answers in a table format, to make it easier to 'grok' at the output, since I'm usually asking it to give me comparisons I just can't do natively on Amazon or any other ecommerce site.
Funnily enough, Amazon will pick for you products to compare, but the compared items usually are terrible, and you can't just add whatever you want, or choose columns.
With Grok, I'll have it remove columns, add columns, shorten responses, so on and so forth.
As a user, I've found that researching the same topics in OpenAI Deep Research vs Perplexity's Deep Research results in "narrow and deep" vs "shallow and broad".
OpenAI tends to have something like 20 high quality sources selected and goes very deep in the specific topic, producing something like 20-50 pages of research in all areas and adjacent areas. It takes a lot to read but is quite good.
Perplexity tends to hit something like 60 or more sources, goes fairly shallow, answers some questions in general ways but is excellent at giving you the surface area of the problem space and thoughts about where to go deeper if needed.
OpenAI takes a lot longer to complete, perhaps 20x longer. This factors heavily into whether you want a surface-y answer now or a deep answer later.
I think it really comes down to your own workflow. You sometimes want to be more imperative (select the sources yourself to generate a report) and sometimes more declarative (let a DFS/BFS algo go and split a query into subqueries and go down rabbit holes until some depth and then aggregate).
Been trying different ways of optimizing the former but I am fascinated by the more end to end flows systems like STORM do.
It's amazing how these are the biggest information organizing platforms in the internet and yet they fail to find different words to describe their products.
The primary issue with deep research tools are veracity and accurate source attribution. My issue with tools relying on DeepDeek R, for example, is the high hallucination rate.
This gives STORM a high mark but didn't seem to get great results from GPT Researcher which is the other open source project that was doing this before the recent flavor of the day DeepReasearch has become.
But there are so many ways to configure GPT Researcher for all kinds of budgets so I wonder if this comparison really pushed the output or just went with defaults and got default midranges for comparison.
Isn't this the worst possible case for an LLM? The integrity of the product is central to the value of it and the user is by definition unable to verify that integrity?
I'm not following, what's "by definition" here? You can verify the integrity of an AI report in the same way you would do with any other report someone prepared for you - when you encounter something that feels wrong, check the referenced source yourself.
I think it's normal to invest authority in a report that someone else has prepared - people take what is written on trust because the person who prepared it is accountable for any errors... not just now, but forever.
That's a lofty ideal, but the typical research report is flawed in multiple ways, and even in academia "Most Published Research Findings Are False" [0], and very few people's careers have in any way suffered, as there's very little de-facto accountability for this trust.
The best case scenario as I see it, is that we become more critical of research in general, and start training AI to help us identify these issues. And then we could perhaps utilize GAN to improve report generation.
I noticed that these models very quickly start to underperform regular search like Perplexity Pro 3x. It might be somewhat thorough in how it goes line by line, but it's not very cognizant of good sources - you might ask for academic sources but if your social media slider is turned on, it will overwhelmingly favour Reddit.
You may repeat instructions multiple times, but it ignores them or fails to understand a solution
You have to be specific about which model you're referring to. OpenAI DeepResearch does not have a slider, and does follow when you say to only use academic sources.
I think "deep research" is a misnomer, possibly deliberate. Research assumes the ability to determine quality, freshness, and veracity of the sources directly from their contents. It also quite often requires that you identify where the authors screwed up, lied, chose deliberately bad baselines, and omitted results in order to make their work "more impactful". You'd need AGI to do that. This is merely search - it will search for sources, summarize them for you, and write a report, but you then have to go in and _verify it's not bullshit_, which can take quite a bit of time and effort, if you care about the quality of the final result, which for big ticket questions you almost always do.
I really like the distinction between DeepSearch and DeepResearch proposed in this piece by Han Xiao: https://jina.ai/news/a-practical-guide-to-implementing-deeps...
> DeepSearch runs through an iterative loop of searching, reading, and reasoning until it finds the optimal answer. [...]
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports
Given these definitions, I think DeepSearch is the more valuable and interesting pattern. It's effectively RAG built using tools in a loop, which is much more likely to answer questions effectively than more traditional RAG where there is only one attempt to find relevant documents to include in a single prompt to an LLM.
DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
More notes here: https://simonwillison.net/2025/Mar/4/deepsearch-deepresearch...
> DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
I think that if done well deep research can be more than that. At a minimum, I would say that before "deep search" you'd need some calls to an LLM to figure out what to look for, what places would be best to look for (i.e. sources, trust, etc), how to tabulate the data gathered and so on. Just as deep search is "rag w/ tools in a loop", so can (should) be deep research.
Think of the analogy of using aider straight up going to code or using it to first /architect and then code. But for any task that lends itself to (re)searching. At least it would catch useless tangents faster.
At the end of the day, what's fascinating about LLM based agents is that you can almost always add another layer of abstraction on top. No matter what you build, you can always come from another angle. That's really cool imo, and something Hassabis has hinted lately in some podcasts.
So I’ve started a thing with Jim’s, and the first effort I am doing is erring the “tone” meaning I’m building a project template that will keep the bots focused.
I think that one part of the deep loop needs to be a check-in on expectations and goals…
So instead of throwing a deep task: I find that bots work better in small iterative chucks of objectives..
I haven’t formulated it completely yet but as an example ive been working extensively with cursors whole anthropic abstraction ai as a service:
So many folks suffer from “generating” quagmire;
And I found that telling the bot to “break any response into smaller chunks to avoid context limitations” works incredibly well…
So when my scaffold is complete the goal is to use Fabric Patterns for nursery assignments to the deep bots.. whereby they constantly check in.
Prior to “deep” things I found this to work really well by telling the bots about obsessively development_diray.md and .json tracking of actions (even still their memory is super small, and I envisioned a multi layer of agents where the initial agents actions feed the context of agents who follow along and you have a waterfall of context between agents so as to avoid context loss on super deep iterative research…
(I’ll type out something more salient when I have a KVM…
(But I hope that doesn’t sound stupid)
What are fabric patterns?
Basically agentic personas or modus operandai
You tell the agent “grok this persona to accomplish the task
I think the person who wrote that probably meant these, specifically: https://github.com/danielmiessler/fabric/tree/main/patterns
Yes, as I am the one who wrote that.
Right - I'm finding the flawed Deep Research tools useful already, but what I really want is much more control over the sources of information they use.
Sadly I think that’s why non-open source commercial deep (re)search implementations are going to be largely useless. Even if you’re using a customized end point for search like Kagi, the sources are mostly garbage and no one except maybe Google Books has the resources and legal cover to expand that deep search into books, which are much better sources.
Exactly - like my whole codebase, or repositories of proprietary documents
First, SimonW, I devour everything you write and appreciate you most in the AI community and recommend you from 0 all the way to 1!!!
Thank you.
-
Second, thank you for bringin up Jina, I recently discovered it and immediate began building a Thing based on it:
I want to use its functions to ferret-out all the entanglements from the roster from the WEF Leadership roster, similar to the NGO fractal connections - I’m doing that with every WEF member, through to Congress.
I would truly wish to work with you on such, I so inclined..
I prefer to build “dossiers” rather than reports, and represented in json schemas
I’m on mobile so will provide more details when at machine…
Looping through a dossier of connections is much more thoughtful than a “report” imo.
I need to see you on someone’s podcast, else you and I should make one!
Thanks! I've done quite a few podcast things recently, they're tagged on my blog: https://simonwillison.net/tags/podcast-appearances/
Dub, yeah I saw that but hadn’t listened yet.
What I want is a “podcast” with audience participation..
The lex fridman DeepSeek episode was so awesome but I have so many questions and I get exceedingly frustrated when lex doesn’t ask what may seem obv to us HNers…
-
Back topic:
Reports are flat; dossiers re malleable.
As I mentioned my goal is fractal visuals (in minamGL) of the true entanglements from the WEF out.
Much like mike Benz on usAid - using jina deep research, extraction etc will pull back the veil on the truth of the globalist agenda seeking control and will reveal true relationships, loyalties and connections.
It been running through my head for decades and I finally feel that jina is a tool that can start to reveal what myself and so many others can plainly see but can’t verify.
I'm an idiot...
anyway - I just watched it.
Fantastic.
---
What are your thoughts on API-verload -- We will need bots and bots to keep track of the APIS...
Meaning... Are we to establish an Internet Protocol ~~ AI Protocol addressing schema - whereby every API henceforth has an IP v6 addy?
and instead of "domain names" for sale -- you are purchasing and registering an IPv6-->API.MY.SERVICE
I watched them...
So amazed how aligned we are.
> DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
Yup, I got the same impression reading this article - and the Jina one, too. Like with langchain and agents, people are making chained function calls or a loop sound like it is the second coming, or a Nobel prize-worthy discovery. It's not - it's obvious. It's just expensive to get to work reliably and productize.
It is obvious now, but I will defend langchain which is catching a stray here.
Folks associated with that project were some of the first people to talk about patterns like this publicly online. With GPT-3-Davinci it was not at all obvious that chaining these calls would work well, to most people, and I think the community and team around langchain and associated projects did a pretty good job of helping to popularize some of the patterns such as they are now obvious.
That said, I agree with your overall impression.
I thought DeepResearch has the AI driving the process because it's been trained to do so vs DeepSearch is something like langchain + prompt engineering?
There are three different launched commercial products called "Deep Research" right now - from Google Gemini, OpenAI and Perplexity. There are also several open source projects that use the name "Deep Research".
DeepResearch (note the absence of the space character) is the name that Han Xiao proposes for the general pattern of generating a research-style report after running multiple searches.
You might implement that pattern using prompt engineering or using some custom trained model or through other means. If the eventual output looks like a report and it ran multiple searches along the way it fits Han's "DeepResearch" definition.
Why is the AI industry so follow-the-leader in naming stuff? I've used at least three AI services called "Copilot".
Maybe for the same reason JavaScript is named Java Script and looks like Java (instead of being Scheme which it almost was, twice)? That is, purposeful name collision with an existing tool/buzzword that's very popular with non-technical management and corporate executives.
I mean, these companies have enough issues with branding and explaining the differences between AI products, even those they provide (GPT-4, o1, o3-mini, etc. - do most openAI users know the differences between them or what they each specialise at?).
I guess they will take any opportunity to follow the leader here if they worry that they're at risk of similar branding issues here too.
> DeepResearch is a cosmetic enhancement that wraps the results in a "report"
No, that's not what Xiao said here. Here's the relevant quote
> It often begins by creating a table of contents, then systematically applies DeepSearch to each required section – from introduction through related work and methodology, all the way to the conclusion. Each section is generated by feeding specific research questions into the DeepSearch. The final phase involves consolidating all sections into a single prompt to improve the overall narrative coherence.
(I also recommend that you stare very hard at the diagrams.)
Let me paraphrase what Xiao is saying here:
A DeepSearch is a primitive — it does mostly the same thing a regular LLM query does, but with a lot of trained-in thinking and searching work, to ensure that it is producing a rigorous answer to your question. Which is great: it means that DeepSearch is more likely to say "I don't know" than to hallucinate an answer. (This is extremely important as a building block; an agent needs to know when a query has failed so it can try again / try something else.)
However, DeepSearch alone still "hallucinates" in one particular way: it "hallucinates understanding" of the topic, thinking that it already has a complete mental toolkit of concepts needed to solve your problem. It will never say "solving this sub-problem seems to require inventing a new tool" and so "branch off" to another recursed DeepSearch to determine how to do that. Instead, it'll try to solve your problem with the toolkit it has — and if that toolkit is insufficient, it will simply fail.
Which, again, is great in some ways. It means that a single DeepSearch will do a (semi-)bounded amount of work. Which means that the costs of each marginal additional DeepSearch call are predictable.
But it also means that you can't ask DeepSearch itself to:
• come up with a mathematical proof of something, where any useful proof strategy will implicitly require inventing new math concepts to use as tools in solving the problem.
• do investigative journalism that involves "chasing leads" down a digraph of paths; evaluating what those leads have to say; and using that info to determine new leads.
• "code me a Facebook clone" — and have it understand that doing so involves iteratively/recursively building out a software architecture composed of many modules — where it won't be able to see the need for many of those modules at "design time", but will only "discover" the need to write them once it gets to implementation time of dependent modules and realizes that to achieve some goal, it must call into some code / entire library that doesn't exist yet. (And then make a buy-vs-build decision on writing that code vs pulling in a dependency... which requires researching the space of available packages in the ecosystem, and how well they solve the problem... and so on.)
A DeepResearch model, meanwhile, is a model that looks at a question, and says "is this a leaf question that can be answered directly — or is this a question that needs to be broken down and tackled by parts, perhaps with some of the parts themselves being unknowns until earlier parts are solved?"
A DeepResearch model does a lot of top-level work — probably using DeepSearch! — to test the "leaf-ness" of your question; and to break down non-leaf questions into a "battle plan" for solving the problem. It then attempts solutions to these component problems — not by calling DeepSearch, but by recursively calling itself (where that forked child will call DeepSearch if the sub-problem is leaf-y, or break down the sub-problem further if not.)
A DeepResearch model will then takes the derived solutions for dependent problems into account in the solution space for depending problems. (A DeepResearch model may also be trained to notice when it's "worked into a corner" by coming up with early-phase solutions that make later phases impossible; and backtracking to solve the earlier phases differently, now with in-context knowledge of the constraints of the later phases.)
Once a DeepResearch model finds a successful solution to all subproblems, it takes the hierarchy of thinking/searching logs it generated in the process, and strips out all the dead-ends and backtracking, to present a comprehensible linear "success path." (Probably it does this as the end-step of each recursive self-call, before returning to self, to minimize the amount of data returned.)
Note how this last reporting step isn't "generating a report" for human consumption; it's a DeepResearch call "generating a report" for its parent DeepResearch call to consume. That's special sauce. (And if you think about it, the top-level call to this whole thing is probably going to use a non-DeepResearch model at the end to rephrase the top-level DeepResearch result from a machine-readable recurse-result report into a human-readable report. It might even use a DeepSearch model to do that!)
---
Bonus tangent:
Despite DeepSearch + DeepResearch using a scientific-research metaphor, I think an enlightening comparison is with intelligence agencies.
DeepSearch alone does what an individual intelligence analyst does. You hand them an individually-actionable question; they run through a "branching, but vaguely bounded in time" process of thinking and searching, generating a thinking log in the process, eventually arriving at a conclusion; they hand you back an answer to your question, with a lot of citations — or they "throw an exception" and tell you that the facts available to the agency cannot support a conclusion at this time.
Meanwhile, DeepResearch does what an intelligence agency as a whole does:
1. You send the agency a high-level strategic Request For Information;
2. the agency puts together a workgroup composed of people with trained-in expertise with breaking down problems (Intelligence Managers), and domain-matter experts with a wide-ranging gestalt picture of the problem space (Senior Intelligence Analysts), and tasks them with breaking down the problem into sub-problems;
3. some of these sub-problems are actionable — they can be assigned directly for research by a ground-level analyst; some of these sub-problems have prerequisite work that must be done to gather intelligence in the field; and some of these sub-problems are unknown unknowns — missing parts of the map that cannot be "planned into" until other sub-problems are resolved.
4. from there, the problem gets "scheduled" — in parallel, (the first batch of) individually-actionable questions get sent to analysts, and any field missions to gather pre-requisite intelligence are kicked off for planning (involving spawning new sub-workgroups!)
5. the top-level workgroup persists after their first meeting, asynchronously observing the reports from actionable questions; scheduling newly-actionable questions to analysts once field data comes in to be chewed on; and exploring newly-legible parts of the map to outline further sub-problems.
6. If this scheduling process runs out of work to schedule, it's either because the top-level question is now answerable, or because the process has worked itself into a corner. In the former case, a final summary reporting step is kicked off, usually assigned to a senior analyst. In the latter case, the workgroup reconvene to figure out how to backtrack out of the corner and pursue alternate avenues. (Note that, if they have the time, they'll probably make "if this strategy produces results that are unworkable in a later step" plans for every possible step in their original plan, in advance, so that the "scheduling engine" of analyst assignments and fieldwork need never run dry waiting for the workgroup to come up with a new plan.)
You're right, Han didn't define DeepResearch as "a cosmetic enhancement". I quoted his sentence long definition:
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports.
But then called it "a cosmetic enhancement" really to be slightly dismissive of it - I'm a skeptic of the report format because I think the way it's presented makes the information look more solid than it actually is. My complaint is at the aesthetic level, not relating to the (impressive) way the report synthesis is engineered.
So yeah, I'm being inaccurate and a bit catty about it.
Your explanation is much closer to what Han described, and much more useful than mine.
One of my co-workers joked at the time that "sure AlphaGO beat Lee Sedol at GO, but Lee has a much better self-driving algorithm."
I thought this was funny at the time, but I think as more time passes it does highlight the stark gulf that exists between the capability of the most advanced AI systems and what we expect as "normal competency" from the most average person.
> it does highlight the stark gulf that exists between the capability of the most advanced AI systems and what we expect as "normal competency" from the most average person
Yes, but now we're at the point where we can compare AI to a person, whereas five years ago the gap was so big that that was just unthinkable.
I mean people thought ELIZA was AI back in the 1960's. Everyone always thinks "this is it!!".
> people thought ELIZA
But which people? Those people which show that a supplement of extra intelligence, also synthetic, is sought.
It was. The definition of "AI" keeps shifting.
[dead]
Love me some good old whataboutism (sure, LLMs are now super-intelligent at writing software, but can they clean my kitchen? No? Ha!)
The computer beat me at chess, but it was no match for me at kickboxing. - Emo Phillips
Tale as old as time. We can make nice software systems but general purpose AI / Agents isn't here yet.
Worse than that: it seems that it's much easier to make computer achieve superhuman feats in cognitive work, than it is to make it do even most basic physical interactions with the real world.
In short: the natural order of things is that computers are better at thinking, and people are better at manual labor. Which is the opposite of what we wanted.
AI is just hydraulics for the mind. Or should be.
I choose a direction and apply force.
Think this captures one of the bigger differences between what Open AI offers and what others offer using the same name. Funnily enough, Google's Gemini 2.0 Flash also has a native integration to google search[1]. They have not done it with their Thinking model. When they do we will have a good comparison.
One of the implications of OpenAI's DR is that frontier labs are more likely to train specific models for a bunch of tasks, resulting in the kind of quality wrappers will find hard to replicate. This is leading towards model + post training RL as a product, instead of keeping them separate from the final wrapper as product. Might be interesting times if the trajectory continues.
PS: There is also genspark MOA[2] which creates an indepth report on a given prompt using mixtures of agents. From what i have seen in 5-6 generations, this is very effective.
[1]: https://x.com/_philschmid/status/1896569401979081073 (i might be misunderstanding this, but this seems a native call instead of explicit)
[2]: https://www.genspark.ai/agents?type=moa_deep_research
deep search is the new RAG
Deep Search is RAG - that is, if we're still expanding the acronym instead of treating it as a word that just means "queries a vector database".
Prediction for Next Hot Thing in Q4 2025 / Q1 2026: someone will make the Nobel prize-worthy discovery that you can stuff results of your deep search into a database (vector or otherwise) and then use it to improve the ability to compile a higher-quality report from much larger amount of sources.
We'll call it DeepRAG or Retrieval Augmented Deep Research or something.
Prediction for Q2 2026: next Nobel prize awarded for realizing you may as well stop treating report generation as the core aspect of "deep research" (as it obviously makes no sense, but hey, time traveler's spoilers, sorry!), and stop at the "stuff search results into a database" and let users "chat with the search results", making the research an interactive process.
We'll call this huge scientific breakthrough "DeepRAG With Human Feedback", or "DRHF".
There's a DeepRAG paper already ser https://arxiv.org/abs/2502.01142
Then we'll have to call my thing Deep Embedding RAG Protocol, or DERP.
DR is a nice way to gather information, when it works, and then do the real research yourself from a concentrated launching point. It helps me avoid ADD braining myself into oblivion every time I search the internet. The fatal mistake is thinking that the LLM is now wiser for having done it. When someone does their research, they are now marginally more of an authority on that topic than everyone else in the room, all else being equal. But for LLMs, it's not like they have suddenly acquired more expertise on the subject now that they did this survey. So it's actually pretty shallow, not deep, research. It's a cool capability, and a nifty way to concentrate information, but much deeper capabilities will be required to have models that not only truly synthesize all that information, but actively apply it to develop a thesis or further a research program. Truthfully, I don't see how this is possible within the transformer architecture, with its lack of memory or genuine statefulness and therefore absence of persistent real time learning. But I am generally a big transformer skeptic.
Are you telling me that AI's are starting to diverge and that we might get a combinatorial explosion of reasoning paths that will produce so many different agents that we won't know which one can actually become AGI?
https://leehanchung.github.io/assets/img/2025-02-26/05-quadr...
They are certainly diverging, and becoming more useful tools, but the answer to “which one can actually become AGI?” is, as always, “none of them.”
AGI is about performing actions with a high multi-task intelligence. Only the top right corner (Deep, Trained) has any hope of getting closer to AGI. The rest can still be useful for specific tasks, e.g. "deep-research".
Life is so complicated.
AGI is, and always was, marketing from LLM providers.
Real innovation is going in with task-specific models like AlphaFold.
LLMs are starting to become more task specific too, as we’ve seen with the performance of reasoning model on their specific tasks.
I imagine we’ll see LLMs trained specifically for medical purposes, legal purposes, code purposes, and maybe even editorial purposes.
All useful in their own way, but none of them even close to sci-fi.
AGI is a term that's been around for decades.
AI was a term that was around for decades as well, but it’s meaning dramatically changed over the past 3 years.
Prior to gpt-3 AI was rarely used in marketing or to talk about any number of ML methods.
Nowadays “AI” is just the new “smart” for marketing products.
Terms change. The current usage of AGI, especially in the context I was talking about, is specifically marketing from LLM providers.
I’d argue that the term AGI, when used in a non fiction context, has always been a meaningless marketing term of some kind.
Feels like you were born just yesterday :).
> Prior to gpt-3 AI was rarely used in marketing or to talk about any number of ML methods.
In the decade prior to GPT-3, AI was frequently used in marketing to talk about any ML methods, up to and including linear regression. This obviously ramped up heavily after "Deep Learning" got coined as a term.
AI now actually means something in marketing, but the only reason for that is that calling out to an LLM is even simpler than adding linear regression to your product somewhere.
As for AGI, that was a hot topic in some circles (that are now dismissed as "AI doomers") for decades. In fact, OpenAI started with people associating or at least within the sphere of influence of LessWrong community, which both influenced the naming and perspective the "LLM industry" started with, and briefly put the outputs of LessWrong into spotlight - which is why now everyone uses terms like "AGI" and "alignment" and "AI safety".
However, unlike "alignment", which got completely butchered as a term, AGI still roughly means what it meant before - which is basically a birth of a man-made god. That's true of AGI as "meaningless marketing term" too, if people so positive on it paused to follow through the implications beyond "oh it's like ChatGPT, but actually good at everything".
> has always been
Well now it is not: it is now "the difference between something with outputs that sounds plausible vs something with outputs which are properly checked".
History begs to differ – one of the biggest learning is that larger, generic models always win (where generalization pays off – ie. all those agents and what-not <<doesn't apply to specialized models like alphago or alphafold, those are not general models>>).
> one of the biggest learning is that larger, generic models always win
You’re confusing several different ideas here.
The idea you’re talking about is called “the bitter lesson.” It (very basically) says that a model with more compute put behind it will perform better than a cleverer method which may use less compute. Has nothing to do with being “generic.” It’s also worth noting that, afaik, it’s an accurate observation, but not a law or a fact. It may not hold forever.
Either way, I’m not arguing against that. I’m saying that LLMs are too general to be useful in specific, specialized, domains.
Sure bigger generic models perform (increasingly marginally) better at the benchmarks we’ve cooked up, but they’re still too general to be that useful in any specific context. That’s the entire reason RAG exists in the first place.
I’m saying that a language model trained on a specific domain will perform better at tasks in that domain than a similar sized model (in terms of compute) trained on a lot of different, unrelated text.
For instance, a model trained specifically on code will produce better code than a similarly sized model trained on all available text.
I really hope that example makes what I’m saying self-evident.
You're explaining it nicely and then seem to make mistake that contradicts what you've just said – because code and text share domain (text based) – large, generic models will always out-compete smaller, specialized ones – that's the lesson.
If you'd compare it with ie. model for self driving cars – generic text models will not win because they operate in different domain.
In all cases trying to optimize on subset/specialized tasks within domain is not worth the investment because state of art will be held by larger models working on the whole available set.
I think they're wrong, but you're also making a related mistake here:
> You're explaining it nicely and then seem to make mistake that contradicts what you've just said – because code and text share domain (text based)
"Text" is not the domain that matters.
The whole trick behind LLMs being as capable as they are, is that they're able to tease out concepts from all that training text - concepts of any kind, from things to ideas to patterns of thinking. The latent space of those models has enough dimensions to encode just about any semantic relationship as some distinct direction, and practice shows this is exactly what happens. That's what makes style transfer pretty much a vector operation (instead of "-King +Woman", think "-Academic, +Funny"), why LLMs are so good at translating between languages, from spec to code, and why adding modalities worked so well.
With LLMs, the common domain between "text" and "code" is not "text", but the way humans think, and the way they understand reality. It's not the raw sequences of tokens that map between, say, poetry or academic texts and code - it's the patterns of thought behind those sequences of tokens.
Code is a specific domain - beyond being the lifeblood of programs, it's also an exercise in a specific way of thinking, taken up to 11. That's why learning code turned out to be crucial for improving general reasoning abilities of LLMs (the same is, IMO, true for humans, but it's harder to demonstrate a solid proof). And conversely, text in general provides context for code that would be hard to infer from code alone.
> because code and text share domain (text based) – large, generic models will always out-compete smaller, specialized ones – that's the lesson
All digital data is just 1s and 0s.
Do you think a model trained on raw bytes would perform coding tasks better than a model trained on code?
I have a strong hunch that there’s some Goldilocks zone of specificity for statistical model performance and I don’t think “all text” is in that zone.
Here is the article for “the bitter lesson.” [0]
It talks about general machine learning strategies which use more compute are better at learning a given data set then a strategy tailor made for that set.
This does not imply that training on a more general dataset will yield more performance than using a more specific dataset.
The lesson is about machine learning methods, not about end-model performance at a specific task.
Imagine a logistic regression model vs an expert system for determining real estate prices.
The lesson tells us that, given the more and more compute, the logistic regression model will perform better than the expert system.
The lesson does not imply that when given 2 logistic regression models, one trained on global real estate data and one trained on local, that the former would outperform.
I realize this is a fine distinction and that I may not be explaining it as well as I could if I were speaking, but it’s an important distinction nonetheless.
[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
> AGI is, and always was, marketing from LLM providers.
TIL: The term AGI, which we've been using since at least 1997[0] was invented by time-traveling LLM companies in the 2020s.
[0]: https://ai.stackexchange.com/questions/20231/who-first-coine...
He didn't say that it was invented by LLM providers...
> In natural language processing (NLP) terms, this is known as report generation.
I'm happy to see some acknowledgement of the world before LLMs. This is an old problem, and one I (or my team, really) was working on at the time of DALL-E & ChatGPT's explosion. As the article indicated, we deemed 3.5 unacceptable for Q&A almost immediately, as the failure rate was too high for operational reporting in such a demanding industry (legal). We instead employed SQuAD and polished up the output with an LLM.
These new reasoning models that effectively retrofit Q&A capabilities (an extractive task) onto a generative model are impressive, but I can't help but think that it's putting the cart before the horse and will inevitably give diminishing returns in performance. Time will tell, I suppose.
Its interesting it says Grok excels at report generation, because I've found myself asking it to give me answers in a table format, to make it easier to 'grok' at the output, since I'm usually asking it to give me comparisons I just can't do natively on Amazon or any other ecommerce site.
Funnily enough, Amazon will pick for you products to compare, but the compared items usually are terrible, and you can't just add whatever you want, or choose columns.
With Grok, I'll have it remove columns, add columns, shorten responses, so on and so forth.
As a user, I've found that researching the same topics in OpenAI Deep Research vs Perplexity's Deep Research results in "narrow and deep" vs "shallow and broad".
OpenAI tends to have something like 20 high quality sources selected and goes very deep in the specific topic, producing something like 20-50 pages of research in all areas and adjacent areas. It takes a lot to read but is quite good.
Perplexity tends to hit something like 60 or more sources, goes fairly shallow, answers some questions in general ways but is excellent at giving you the surface area of the problem space and thoughts about where to go deeper if needed.
OpenAI takes a lot longer to complete, perhaps 20x longer. This factors heavily into whether you want a surface-y answer now or a deep answer later.
I wet through this journey myself with Deep Search / Research https://github.com/btahir/open-deep-research
I think it really comes down to your own workflow. You sometimes want to be more imperative (select the sources yourself to generate a report) and sometimes more declarative (let a DFS/BFS algo go and split a query into subqueries and go down rabbit holes until some depth and then aggregate).
Been trying different ways of optimizing the former but I am fascinated by the more end to end flows systems like STORM do.
It's amazing how these are the biggest information organizing platforms in the internet and yet they fail to find different words to describe their products.
The primary issue with deep research tools are veracity and accurate source attribution. My issue with tools relying on DeepDeek R, for example, is the high hallucination rate.
This gives STORM a high mark but didn't seem to get great results from GPT Researcher which is the other open source project that was doing this before the recent flavor of the day DeepReasearch has become.
But there are so many ways to configure GPT Researcher for all kinds of budgets so I wonder if this comparison really pushed the output or just went with defaults and got default midranges for comparison.
Isn't this the worst possible case for an LLM? The integrity of the product is central to the value of it and the user is by definition unable to verify that integrity?
I'm not following, what's "by definition" here? You can verify the integrity of an AI report in the same way you would do with any other report someone prepared for you - when you encounter something that feels wrong, check the referenced source yourself.
I think it's normal to invest authority in a report that someone else has prepared - people take what is written on trust because the person who prepared it is accountable for any errors... not just now, but forever.
LLM's are not accountable for anything.
That's a lofty ideal, but the typical research report is flawed in multiple ways, and even in academia "Most Published Research Findings Are False" [0], and very few people's careers have in any way suffered, as there's very little de-facto accountability for this trust.
The best case scenario as I see it, is that we become more critical of research in general, and start training AI to help us identify these issues. And then we could perhaps utilize GAN to improve report generation.
[0] https://en.wikipedia.org/wiki/Replication_crisis
P.S. There's a paper from today demonstrating the effectiveness of AI identifying errors in human research: https://news.ycombinator.com/item?id=43295692
I noticed that these models very quickly start to underperform regular search like Perplexity Pro 3x. It might be somewhat thorough in how it goes line by line, but it's not very cognizant of good sources - you might ask for academic sources but if your social media slider is turned on, it will overwhelmingly favour Reddit.
You may repeat instructions multiple times, but it ignores them or fails to understand a solution
You have to be specific about which model you're referring to. OpenAI DeepResearch does not have a slider, and does follow when you say to only use academic sources.
Why is there a citation @article part in the end? Do people actually use it?
I think "deep research" is a misnomer, possibly deliberate. Research assumes the ability to determine quality, freshness, and veracity of the sources directly from their contents. It also quite often requires that you identify where the authors screwed up, lied, chose deliberately bad baselines, and omitted results in order to make their work "more impactful". You'd need AGI to do that. This is merely search - it will search for sources, summarize them for you, and write a report, but you then have to go in and _verify it's not bullshit_, which can take quite a bit of time and effort, if you care about the quality of the final result, which for big ticket questions you almost always do.
Yes, but what about Deep Research?
what is the best open source system to use?
Neat summary but you forgot Grok!
Maybe the article has been edited in the last four minutes since you posted, but Grok is definitely in there now.
[dead]