Left icon

ChemBench

Adrian Mirza*, Nawaf Alampara*, Sreekanth Kunchapu*, Martiño Ríos-García*,..., Mara Schilling-Wilhelmi,..., Anagha Aneesh, Kevin Maik Jablonka
1 Friedrich Schiller University Jena (FSU-Jena) 2 Helmholtz Institute for Polymers in Energy Applications Jena
*Core Contributors Corresponding Authors
Rank Model Overall
Score
Analytical Chemistry Chemical Preference General Chemistry Inorganic Chemistry Materials Science Organic Chemistry Physical Chemistry Technical Chemistry Toxicity and Safety

Introduction

ChemBench is a cutting-edge framework to evaluate the chemical knowledge and reasoning capabilities of large language models (LLMs). While LLMs excel in general domains, their chemistry expertise remains unexplored. ChemBench fills this gap with 2,700+ curated question-answer pairs across diverse chemistry topics, plus advanced features like visual LLM support, batched inference, and refusal counting. It uniquely encodes chemical semantics, enabling models to process and reason about molecules and equations.

We benchmarked closed- and open-sourced LLMs against human experts and were struck by the results: top models outperformed most humans. Yet, LLMs still struggle with human-aligned preferences, knowledge-intensive questions, and specialized reasoning. These gaps highlight the need for domain-specific training and improved reasoning.

ChemBench not only provides a robust evaluation tool for easy benchmarking, but also illuminates LLMs' strengths and weaknesses in chemistry, paving the way for smarter AI in scientific fields.

Unpacking Model Performance Across Chemistry Subfields

How do AI models hold up against human experts when tested on specialized topics in chemistry?

The answer is not so straightforward. Figure 1 highlights uneven capabilities: while models excel in broad areas like general chemistry and technical concepts, they struggle with nuanced tasks.

Figure 1
Figure 1: Radar plot of performance across distinct topics from analytical chemistry to toxicity.
  • Predicting NMR signals – a task requiring analysis of molecular symmetry – proved challenging even for top-performing models, with accuracy dipping below 25% in some cases. Human experts, equipped with visual diagrams, outperform models that must infer structure solely form SMILES strings.
  • Models aced textbook-style questions (e.g., scoring 71% on certification exams) but faltered on novel reasoning tasks. This gap underscores a critical insight: strong performance on traditional benchmarks does not guarantee mastery of applied problem-solving.
  • Models showed no correlation between molecular complexity and accuracy—suggesting they rely on memorization rather than true structural reasoning.
  • This analysis urges caution: while AI mirrors human expertise in some domains, its "knowledge" remains brittle and context-dependent.

The Challenge of Trusting AI Confidence

Can AI know when its right? We put leading language models to the test in ChemBench, challenging them to rate their own confidence in providing correct answers (Fig. 2). Through systematic prompting, we measured how well these models could predict their own success. Models were asked to self-report confidence levels on a defined scale when answering technical questions.

Figure 2
Figure 2: Reliability and distribution of confidence estimate

Our analysis reveals signfiicant gaps in AI self-assessment capabilities

  • Most models demonstrate poor calibration between stated certainty and actual performance.
  • Confidence distributions show systematic patterns of over/under-confidence across different models
  • Models showed no correlation between molecular complexity and accuracy—suggesting they rely on memorization rather than true structural reasoning.
  • Case studies highlight critical mismatches: One model expressed maximum confidence (5/5) in incorrect chemical safety answers, while another showed minimal difference in confidence scores between correct and incorrect chemical classification responses

Core Team

Adrian Mirza
Adrian Mirza
Nawaf Alampara
Nawaf Alampara
Sreekanth Kunchapu
Sreekanth Kunchapu
Martiño Ríos-García
Martiño Ríos-García
Mara Schilling-Wilhelmi
Mara Schilling-Wilhelmi
Anagha Aneesh
Anagha Aneesh
Kevin
Kevin M. Jablonka

BibTeX

@misc{mirza2024largelanguagemodelssuperhuman,
                    title={Are large language models superhuman chemists?}, 
                    author={Adrian Mirza and Nawaf Alampara and Sreekanth Kunchapu and Martiño Ríos-García and Benedict Emoekabu and Aswanth Krishnan and Tanya Gupta and Mara Schilling-Wilhelmi and Macjonathan Okereke and Anagha Aneesh and Amir Mohammad Elahi and Mehrdad Asgari and Juliane Eberhardt and Hani M. Elbeheiry and María Victoria Gil and Maximilian Greiner and Caroline T. Holick and Christina Glaubitz and Tim Hoffmann and Abdelrahman Ibrahim and Lea C. Klepsch and Yannik Köster and Fabian Alexander Kreth and Jakob Meyer and Santiago Miret and Jan Matthias Peschel and Michael Ringleb and Nicole Roesner and Johanna Schreiber and Ulrich S. Schubert and Leanne M. Stafast and Dinga Wonanke and Michael Pieler and Philippe Schwaller and Kevin Maik Jablonka},
                    year={2024},
                    eprint={2404.01475},
                    archivePrefix={arXiv},
                    primaryClass={cs.LG},
                    url={https://arxiv.org/abs/2404.01475}, 
              }