Chemical knowledge and reasoning of large language models vs. chemist expertise

103 points by bookofjoe 4 days ago

Ok so I am always interested in these papers as a chemist. Often, we find that the LLM are terrible at chemistry. This is because the lived experience of a chemist is fundamentally different from the education they receive. Often, a masters student takes 6 months to become productive at research in a new sub field. A PhD, around 3 months.

Most chemists will begin to develop an intuition. This is where the issues develop.

This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.

It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.

So the issue is this: we ask the LLM how many proton environment in this nmr?

We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?

Workaccount2 2 days ago

I think a huge reason why LLMs are so far ahead in programming is because programming exists entirely in a known and totally severed digital environment outside our own. To become a master programmer all you need is a laptop and an internet connection. The nature of it existing entirely in a parallel digital universe just lends itself perfectly to training.
All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.
- alganet 2 days ago
  
  > I think a huge reason why LLMs are so far ahead in programming
  Are they? Last time I checked (couple of seconds ago), they still made silly mistakes and hallucinated wildly.
  Example: https://imgur.com/a/Cj2y8km (AI teaching me about the Coltrane operator, that obviously does not exist).
  
  gcanko 2 days ago
  
  You're using the worst model when it comes to programming, not sure what point you're trying prove here. That's why when someone starts ranting how useless ai models are when it comes to coding I always assume they're just using inferior models.
  
  alganet 2 days ago
  
  My question was very simple. Suitable for a simpler model.
  I can come up with prompts that make better models hallucinate (see post below).
  I don't understand your objection. This is a known fact, LLMs hallucinate shit regardless of the model size.
  
  CamperBob2 2 days ago
  
  LLMs are getting better. Are you?
  Nothing matters in this business except the first couple of time derivatives.
  
  alganet a day ago
  
  Maybe I'm not.
  However, I'm discussing this within the context of the study presented in the paper, not some future yet-to-be-achieved performance expectation.
  If we step outside the context of the paper (not advised), I think any average developer is better than an LLM at energy efficiency. LLMs cheat by consuming more resources than a human. "Better" is quite relative. So, let's keep reasonable.
  
  aoeusnth1 2 days ago
  
  Are you intentionally sandbagging the LLMs to prove a point, or do you really think 4o-mini is good enough for programming?
  Even 2.5 flash easily gets this https://imgur.com/a/OfW30eL
  
  alganet 2 days ago
  
  The point is that I can make them hallucinate quite easily. And they don't demonstrate knowing their own limitations.
  For example, 2.5 Flash fails to explain the difference between the short ternary operator (null coalescing) and the Elvis operator.
  https://imgur.com/a/xKjuoqV
  Even when I specify a language (therefore clearing the confusion, supposedly), it still fails to even recognize the Elvis operator by its toupe, and mixes it up the explanation (it doesn't even understand what I asked).
  https://imgur.com/a/itr87hM
  So, the point I'm trying to make is that they're not any better for programming than they're for chemistry.
  
  CamperBob2 2 days ago
  
  Flash is the wrong model for questions like that -- not that you care -- but if you'd like to share the actual prompt you gave it, I'll try it in 2.5 Pro.
  
  alganet 2 days ago
  
  "explain me the difference between the short ternary operator and the Elvis operator"
  When it failed, I replied: "in PHP".
  You don't seem to understand what I'm trying to say and instead is trying to defend LLMs for a fault that is a fact known in the industry at large.
  I'm sure that in short time, I could make 2.5 Pro hallucinate as well. If not on this question, on others.
  This behavior is inline with the paper conclusions:
  > many models are not able to reliably estimate their own limitations.
  (see Figure 3, they tested a variety of models of different qualities).
  This is the kind of question a junior developer can answer with simple google searches, or by reading the PHP manual, or just by testing it on a REPL. Why do we need a fancy model in order to answer such a simple inquiry? Would a beginner know that the answer is incorrect and he should use a different model?
  Also, from the paper:
  > For very relevant topics, the answers that models provide are wrong.
  > Given that the models outperformed the average human in our study, we need to rethink how we teach and examine chemistry.
  That's true for programming as well. It outperforms the average human, but then it makes silly mistakes that could confuse beginners. It displays confidence in being plain wrong.
  The study also used manually curated questions for evaluation, so my prompt is not some dirty trick. It's totally inline with the context of this discussion.
  
  CamperBob2 a day ago
  
  It's better than it was a year ago, as you'd have discovered for yourself if you used current models. Nothing else matters.
  See if this looks any better (I don't know PHP): https://g.co/gemini/share/7849517fdb89
  If it doesn't, what specifically is incorrect?
  
  alganet a day ago
  
  What I expect from a human is to ask "in which language?", because it makes a difference. If no language was supplied, I expect a brief summary of null coalescing and shorthand ternary options with useful examples in the most popular languages.
  --
  The JavaScript example should have mentioned the use of `||` (or operator) to achieve the same effect of a shorthand ternary. It's common knowledge.
  In PHP specifically, `??` allows you to null coalesce array keys and other types of complex objects. You don't need to write `isset($arr[1]) ? $arr[1] : "ipsum"`, you can just `$arr[1] ?? "ipsum"`. TypeScript has it too and I would expect anyone answering about JavaScript to mention that, since it's highly relevant for the ecosystem.
  Also in PHP, there is the `?:` that is similar to what `||` does in JavaScript in an assignment context, but due to type juggling, it can act as a null coalesce operator too (although not for arrays or complex types).
  The PHP example they present, therefore, is plain wrong and would lead to a warning for trying to access an unset array key. Something that the `??` operator (not mentioned in the response) would solve.
  I would go as far as explaining null conditional acessors as well `$foo?->bar` or `foo?.bar`. Those are often called Elvis operators coloquially and fall within the same overall problem-solving category.
  The LLM answer is a dangerous mix of incomplete and wrong. It could lead a beginner to adopt an old bad practice, or leave a beginner without a more thorough explanation. Worst of all, the LLM makes those mistakes with confidence.
  --
  What I think is going on is that null handling is such a basic task, that programmers learn it in the first few years of their careers and almost never write about it. There's no need to. I'm sure a code-completion LLM can code using those operators effectively, but LLMs cannot talk about them consistently. They'll only get better at it if we get better at it, and we often don't need to write about it.
  In this particular elvis operator thing, there has been no significant improvement in the correctedness of the answer in __more than 2 whole years__. Samples from ChatGPT in 2023 (note my image date): https://imgur.com/UztTTYQ https://imgur.com/nsqY2rH.
  So, _for some things_, contrary to what you suggested before, LLMs are not getting that much better.
  
  CamperBob2 a day ago
  
  Having read the reply in 2.5 Pro, I have to agree with you there. I'm surprised it whiffed on those details. They are fairly basic and rather important. It could have provided a better answer (I fed your reply back to it at https://g.co/gemini/share/7f87b5e9d699 ), but it did a crappy job deciding what to include in its initial response.
  I don't agree that you can pick one cherry example and use it to illustrate anything general about the progress of the models in general, though. There are far too many counterexamples to enumerate.
  (Actually I suspect what will happen is that we'll change the way we write documentation to make it easy for LLMs to assimilate. I know I'm already doing that myself.)
  
  alganet a day ago
  
  > I don't agree that you can pick one cherry example
  Benchmarks and evaluations are made of cherry picked examples. What makes my example invalid, and benchmark prompts valid? (it's a rethorical question, you don't need to answer).
  > write documentation to make it easy for LLMs to assimilate.
  If we ever do that, it means LLMs failed at their job. They are supposed to help and understand us, not the other way around.
  
  CamperBob2 12 hours ago
  
  If we ever do that, it means LLMs failed at their job. They are supposed to help and understand us, not the other way around.
  If you buy into the whole AGI thing, I guess so, but I don't. We don't have a good definition of intelligence, so it's a meaningless question.
  We do know how to make and use tools, though. And we know that all tools, especially the most powerful and/or hazardous ones, reward the work and care that we put into using them. Further, we know that tool use is a skill, and that some people are much better at it than others.
  What makes my example invalid, and benchmark prompts valid?
  Your example is a valid case of something that doesn't work perfectly. We didn't exactly need to invent AI to come up with something that didn't work perfectly. I have examples of using LLMs to generate working, useful code in advanced, specialized disciplines, code that I frankly don't understand myself and couldn't have written without months of study, but that I can validate.
  Just one of those examples is worth a thousand examples like yours, in my book. I can now do things that were simply impossible for me before. It would take some nerve to demand godlike perfection on top of that, or to demand useful results with little or no effort on my part.
  
  alganet 10 hours ago
  
  > We do know how to make and use tools
  It's the same principle. A tool is supposed to assist us, not the other way around.
  An LLM, "AGI magic" or not, is supposed to write for me. It's a tool that writes for me. If I am writing for the tool, there's something wrong with it.
  > I have examples [...] Just one of those examples is worth a thousand examples like yours
  Please, share them. I shared my example. It can be a very small "bug report", but it's real and reproducible. Other people can build on it, either to improve their "tool skills" or to improve LLMs themselves.
  An example that is shared is worth much more than an anectode.
  
  CamperBob2 8 hours ago
  
  It's hard to get too specific without running afoul of NDAs and such, since most of my work is for one customer or another, but the case that really blew me away was when I needed to find a way to correct an oscillator that had inherent stability problems due to a combination of a very good crystal and very poor thermal engineering on the OEM's part. The customer uses a lot of these oscillators, and they are a massive pain point in production test because they often perform so much worse than they should.
  I started out brainstorming with o1-pro, trying to come up with ways to anticipate drift on multiple timescales, from multiple influences with differing lag times, and correct it using temperature trends measured a couple of inches away on a different component. It basically said, "Here, train this LSTM model to predict your drift observations from your observed temperature," and spewed out a bunch of cryptic-looking PyTorch code. It would have been familiar enough to an ML engineer, I'm sure, but it was pretty much Perl to me.
  I was like, Okaaaaayyy....? but I tried it anyway, suggested hyperparameters and all, and it was a real road-to-Damascus moment. Again, I can't share the plots and they wouldn't make sense anyway without a lot of explanation, but the outcome of my initial tests was freakishly good.
  Another model proved to be able to translate the Python to straight C for use by the onboard controller, which was no mean feat in itself (and also allowed me to review it myself), and now that problem is just gone. Basically for free. It was a ridiculous, silly thing to try, and it worked.
  When this tech gets another 10x better, the customer won't need me anymore... and that is fucking awesome.
  
  alganet 8 hours ago
  
  I too have all sorts of secret stuff that I wouldn't share. I'm not asking for that. Isolating and reproducing example behavior is different from sharing your whole work.
  > It would have been familiar enough to an ML engineer, I'm sure, but it was pretty much Perl to me.
  How can you be sure that the solution doesn't have obvious mistakes that an ML engineer would spot right away?
  > When this tech gets another 10x better
  A chainsaw is way better than a regular saw, but it's also more dangerous. Learning to use it can be fun. Learning not to cut your toes is also important.
  I am looking for ways in which LLMs could potentially cut people's toes.
  I know you don't want to hear that your favorite tool can backfire, and you're still skeptic despite having experienced the example I gave you firsthand. However, I was still hopeful that you could understand my point.
  
  CamperBob2 2 days ago
  
  They aren't getting any better at programming, so they naturally assume the LLMs aren't, either.
CGMthrowaway a day ago

>the lived experience of a chemist is fundamentally different from the education they receive. Most chemists will begin to develop an intuition.
Is this a documentation problem? The LLMs are only trained on what is written down. Seems to track with another comment further down quoting:
"Models are limited in ability to answer knowledge-intensive questions, probably because the required knowledge cannot easily be accessed via papers but rather by lookup in specialized databases, which the humans used to answer such questions"
fuzzfactor a day ago

>using BuLi. Why does the solution turn pink?
I would say odds are because of an impurity. My first guess might be the solvent if there is more in action than reagents or reactants. Maybe could be confirmed or denied by some carefully figured filtration beforehand, which might not even be that difficult. I doubt I would try much further than that unless it was a bad problem.
Although for instance an alternate simple purification like distillation is pretty much routine for pure aniline to get some colorless material, and that's some pretty rough stuff to handle.
Now I once was a young chemist facing AI, I ended up highly focused on going forward in ways that would not be "taken over" by AI, and I knew I couldn't be slow or recession still might catch up with me, plus the 1990's were approaching fast ;)
By the mid 1990's I figured there's no way the stuff they have in this paper had not been well investigated.
I always knew it would take people that had way more megabytes than I could afford.
Sheesh, did I overestimate the progress people were making when I wasn't looking.
CamperBob2 2 days ago

Just out of curiosity (not knowing anything about butyllithium other than what I've read on 'Things I Won't Work With'), is this answer from o3-pro even close?
https://chatgpt.com/share/685041db-c324-800b-afc6-5cb2c5ef31...

calibas 2 days ago

I'm sure an LLM knows more about computer science than a human programmer.

Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.

Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.

It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.

mumbisChungo 2 days ago

It's impressive until you realize its limitations.
Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.
- X6S1x6Okd1st 2 days ago
  
  Also that limitations keep dropping every six months
logifail 2 days ago

> Do you know every common programing language?
A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".
Her response at the time was was "Do they have anything interesting to say in any of them?"
- dandellion 2 days ago
  
  As a foreign English speaker, it's a huge pet peeve is when people use acronyms without having used the full sentence before. Especially when the acronym is already a word or expression and looking it up just returns a bunch of useless examples (oh!). Eventually I'll find out the meaning (other half), and it always turns out they only saved a total of six or seven letters, which can be typed in less than 0.5 seconds, but in exchange they made their sentence more or less incomprehensible for a large group of people.
  
  dylan604 2 days ago
  
  As a native English speaker, I had no idea what OH was either. I’ve seen SO for significant other and not stack overflow, and I’ve seen reference to better half not just other half. By that choice, I am left to assume this person feels they are the better half which says a lot about them.
  
  djtango 2 days ago
  
  As a native speaker, you probably scratched your head and worked out what could fit in that gap and eventually worked it out. Then you'll grumble because the other speaker didn't choose your preferred diction.
  As a non native speaker you'll probably just feel upset/hopeless/angry.
  From my experience, "non-native" here includes people who are "fluent".
  So we arrive at the situation where my OH-SO beloved wife is fluent in English and is definitely better than me at writing clearly constructed English essays but when it comes to usage of random idioms/slang or understanding local (and foreign!) English accents I have a very clear advantage.
  
  dylan604 2 days ago
  
  actually, no, other half never popped into my head. i only got it from seeing other comments in the thread of people confused by it as well.
  
  daveguy 2 days ago
  
  > By that choice, I am left to assume this person feels they are the better half which says a lot about them.
  What a ridiculous assumption.
  Maybe they consider themselves and their partner to be equal halves of a whole. You know, the definition of half.
  
  Shadowmist 2 days ago
  
  Paste the comment into an LLM and ask it what it means. Don’t use Google.
  
  glenneroo 2 days ago
  
  OTOH we are one of today's "lucky" 10,000? And future searches will possibly lead to this post, further reducing friction to using this acronym. Also newly trained LLMs will also be able to answer quicker. Yay?
  I wonder how acronyms such as OTOH even become so well known that they can be used without fear or not being understood? When is that threshold reached? Is using OH now the beginning of a new well-known acronym? I guess only time will tell...
  
  theelous3 2 days ago
  
  the far more common and acceptable-to-use-without-introduction acronym for this is SO (significant other)
  And to answer the question - the threshold is when people stop complaining about the use :)
  
  catigula 2 days ago
  
  I've literally never seen "OTOH" in my life. Anyhow, if you really feel your sentence can't do without it you can say "conversely" which is pretty short and clear.
  
  mitb6 2 days ago
  
  OTOH dates back to the 90s and has since remained very common in internet writing. It is more surprising that you've never seen it than that someone used it.
  It also isn't an exact synonym of "conversely".
  
  catigula 2 days ago
  
  There aren't any exact synonyms in English.
  I've been an extensive internet user for decades and I don't have it in memory, so I'm not sure how to feel about your assertion. I'm not the only person saying this.
  
  andruby 2 days ago
  
  > There aren't any exact synonyms in English.
  I'm sure that depends on the tolerance. "assist" and "help"? "dog" and "canine"? "purchase" and "buy"?
  
  catigula 2 days ago
  
  I'm not just being pedantic, it's a fairly mainstream assertion in linguistics. I don't find those words synonymous. They have different performative content. I don't know if this applies to other languages.
  
  dylan604 2 days ago
  
  We are not in a text chat using T9 on a numeric keypad where typing is painful. There’s no need for acronyms now except for the attempt at not looking like an old or just lazy. We’re also not limited to 140 chars, so not an advantage there either.
- arcanemachiner 2 days ago
  
  > OH
  Other half? I've never seen this acronym before.
- Upvoter33 2 days ago
  
  sounds snarky and defensive, tbh
- mock-possum a day ago
  
  Is your other half Richard Feynmann?
timschmidt 2 days ago

> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.
- catigula 2 days ago
  
  And herein lies the fundamental power of the LLM and why it can even solve "impressive" problems: it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
  LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.
  
  dylan604 2 days ago
  
  > it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
  That’s the very reason we built computers. If an LLM did not also meet this definition, there would be no point of it existing
  
  catigula 2 days ago
  
  You're not the first person to suggest that LLMs have no reason to exist.
- anthk 2 days ago
  
  Binwalk, Unicorn... as if it that was advanced wizardry. Unix systems have file(1) since forever and binutils from and to every arch.
  
  Energiekomin 2 days ago
  
  Yes it is and you compare apples with pineapples.
  file can't program in brainfuck while doing basic binary analysis.
  Binwalk and Unicorn can't do that either. And they can't write to you in multiply natural languages either
yMEyUyNE1 2 days ago

> There's simply so much to know that the LLM has an inherent advantage.
But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.
- ben_w a day ago
  
  Does a submarine swim?
  It doesn't matter to my employment prospects if the AI "understands" or "thinks", whatever is meant by that, but rather if potential employers recon it's good enough to not bother employing me.
esafak 2 days ago

But the LLM can already connect things that you can not, by virtue of its breadth. Some may disagree, but I think it will soon go deeper too.
anthk 2 days ago

So impressive that every complex SUBLEQ code I've tried with an LLM failed really fast.

6LLvveMx2koXfwn 2 days ago

Received 01 April 2024

Accepted 26 March 2025

Published 20 May 2025

Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.

eesmith 2 days ago

How so?
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?
- bufferoverflow 2 days ago
  
  How so? All the models they've tested are obsolete, multiple generations behind high-end versions.
  (Though even these obsolete models did better than the best humans and domain experts).
  
  eesmith 2 days ago
  
  As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.
  Good benchmark development is hard work. The paper goes into the details of how it was carried out.
  Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.
  You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.
  That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.
dawnofdusk a day ago

Fast-moving field? This is a chemistry paper not an ML paper. ML people have their conferences which are on much abridged timeframes.
rotis 2 days ago

Yes, this paper and many others will be forgotten as soon as they leave the front page. Afterwards noone refers to articles like these here. People just talk about anecdotes and personal experiences. Not that I think this is bad.
Jimmc414 2 days ago

shows the value of preprint servers like arxiv.org and chemrxiv.org

pu_pe 2 days ago

Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.

I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.

KSteffensen 2 days ago

I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery.
Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD
- 698969 2 days ago
  
  I think the breadth vs depth thing applies here as well, the PhD will know more about the topic they're researching of course.
- eesmith 2 days ago
  
  Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."
  How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?

sgt101 2 days ago

Also, books, books are really good for finding knowledge !

Seriously LLM's as a cultural technology cast them as a super interactive indexing system. I find that's a useful lens to use to understand this kind of study.

gavinray 2 days ago

I asked several LLM's after jailbreaking with prompts to provide viable synthesis routes for various psychoactive substances and they did a remarkable job.

This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.

Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.

Workaccount2 2 days ago

You need a decent amount of experience to make psychoactive substances. Chemistry is one of those things that looks like you just follow the steps, but in practice requires a ton of intuition and "feeling it". You can see this if you watch NileRed on youtube, he is a pretty experienced chemist, and even then still flops all the time trying to replicate reactions right out of the book.
Besides, the books Pihkl and Tikhl lay out how to make most psychoactive substances, and those books have been online for free for decades now.[1][2] Maybe there are easier routes and easier to acquire precursor recipes, but I doubt those would be hard to find. The hardest part by far is the chemistry intuition.
[1]https://erowid.org/library/books_online/pihkal/pihkal.shtml [2]https://erowid.org/library/books_online/tihkal/tihkal.shtml
- magicalhippo a day ago
  
  > You can see this if you watch NileRed on youtube
  Or Extractions & Ire, along with his other channel Explosions & Fire[2], which is a PhD student trying to do chemistry in his shed, literally, using stuff you can get from a well-stocked hardware store or such.
  Often the steps seem straight forward but there are details in the papers that are not covered, or the contaminants from using some brand household product rather than a pure source screws it up.
  Still, his videos are usually quite entertaining regardless of results.
  [1]: https://www.youtube.com/@ExtractionsAndIre
  [2]: https://www.youtube.com/@explosionsandfire
- gavinray a day ago
  
  TiHKal and PiHKaL are fulls of synths that require equipment and re-agents far beyond what a hobbyist would be able to source.
  There are various "one-pot" techniques for certain compounds if one is sufficiently clever.
  For example, a certain cathinone can be produced by combining ephedrine/pseudoephedrine with a household product that reduces secondary alcohols to ketones and letting it sit.
dylan604 2 days ago

My limited bit of knowledge of both chemistry and LLMs would tell me that subtle incorrect chemistry can have disastrous effects while subtle incorrect is an LLM superpower suggests that this is precisely the inevitable outcome
refurb a day ago

What LLM’s?
I’m a chemist and I asked it to show me the structure for a common molecule and it kept getting it really wrong

marcodiego 2 days ago

> [..] models are [...] limited in [...] ability to answer knowledge-intensive questions [...], they did not memorize the relevant facts. [...] This is probably because the required knowledge cannot easily be accessed via papers [...] but rather by lookup in specialized databases [...], which the humans [...] used to answer such questions [...]. This indicates that there is [...] room for improving [...] by training [...] on more specialized data sources or integrating them with specialized databases.

> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.

Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.

X6S1x6Okd1st 2 days ago

It's increasingly looking like if you're young enough most knowledge work will be eaten by LLMs (or the thing that comes next) within your lifetime.
Hopefully we'll see human assisted with AI & induced demand for a good while, but the idea that people work unassisted in knowledge work is gonna go the way of artisan clothing
- hooverd 2 days ago
  
  so much for those birth rates

AvAn12 2 days ago

How much of this is because Scale AI and others have had human “taskers” create huge amounts of domain-specific content for OpenAI and other foundation model providers?

fuzzfactor 4 days ago

Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\

This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)

If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)

Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.

mistrial9 2 days ago

BASF Group - will they speak in public? probably not, given what is at stake IMHO