
Malay Shah
I've ignored most AI benchmarks because they test on academic problems. Last week OpenAI finally released something different: testing on real manufacturing work. It's called GDPval and for the first time I saw manufacturing and business problems that look like what we actually deal with in chemicals and manufacturing more broadly.
I went through the whole paper because the results were fascinating. And a bit concerning. And maybe exciting depending on what part of manufacturing you work in.
The Numbers That Matter
The short version: LLMs are getting close to human-level performance on complex industrial tasks. Things like building production plans, analyzing supply chain data, and optimizing certain processes.
The methodology was quite unique - human experts blindly graded AI outputs against other human expert responses. 50% win rate would mean the AI matches human expertise. The top model hit 47.6% overall, with 45% on manufacturing operations and 53% on wholesale trade tasks.
What surprised me was that Claude Opus 4.1 actually beat OpenAI's models on their own benchmark pretty much across the board in manufacturing. We're seeing similar results internally at Poka Labs.
But getting close to human-level performance doesn't mean we're headed for mass job replacement. The study showed some important gaps where AI still struggles. And those gaps matter a lot depending on what kind of work we're talking about.
Chemical Engineering: Not Yet
If you're a chemical engineer worried about AI taking your job, the recent GDPval numbers from OpenAI should help you sleep better at night. At least for now.
The paper didn't test chemical engineering specifically, but it did test industrial and mechanical engineering tasks. The results? 17% win rate for industrial engineers, 23% for mechanical engineers. I'd bet chemical engineering performs similarly.
Model providers are racing to improve these numbers. By 2026, I wouldn't be surprised if they crack 40% across these evaluations. But even if models reach human parity, I'm still not convinced that leads to widespread engineering replacement.
Why? Look at what's buried in appendix A.2.5 about catastrophic failures. When GPT-5 failed to beat a human, it failed catastrophically 2.7% of the time. We're talking about suggestions that could cause physical harm, insulting customers, completely missing safety considerations.
In chemical engineering, a 2.7% catastrophic failure rate means plants exploding. Regularly.
I don't know if LLM makers can solve both the low performance and the catastrophic failure rate by next year, in 5 years, or if this is a fundamental model limitation. But the gap between what today's AI models can do and what's "safe to do" in chemical plants is massive for many safety critical tasks.
Production and Sales: Already Happening
The manufacturing floor is about to look very different. While engineering tasks still have a gap, the data on production and sales tasks is striking.
Production supervisors building manufacturing plans: Claude Opus 4.1 scored 58%. That's above human-level performance and output.
Manufacturing sales setting prices for technical products in direct sale or distribution settings: 47%. Getting close.
Shipping, receiving, and inventory clerks choosing optimal shipping methods or analyzing inventory returns: 76%. Consistently beating humans in quality and output.
A high score doesn't mean the job disappears. Most of these roles still have physical components AI can't touch. But I'm willing to bet that the nature of these types of jobs will change first in manufacturing. Production management and sales roles will get rewritten, and chemical manufacturers who figure out the cultural shifts to enable this technology will have real competitive advantages in complex markets.
What does this look like in practice? Take pricing. Instead of a sales rep quoting and understanding thousands of SKUs and custom products, the AI handles the heavy lifting of application engineering, technical service, and pricing while the rep focuses on customer relationships.
The technology isn't the bottleneck anymore for these roles. It's figuring out how to weave it into the processes that already exist.
So What Does This Mean?
We're at an inflection point where AI performance varies wildly depending on the type of manufacturing work. Production planning and logistics? AI is already winning. Chemical engineering and safety-critical work? We're nowhere close, and the catastrophic failure problem might be fundamental.
The companies that will win are the ones who are figuring out this jagged frontier and restructuring their organization for the benefits.
Related Reads for You
Discover more articles that align with your interests and keep exploring.