AI Models Score Mixed Results on Rigorous Mathematics Test

The "First Proof" project, organized by leading mathematicians, tested large language models on challenging math problems and found that AI solved 6-7 of 10 problems essentially correctly, though with significant flaws requiring human intervention. The test was designed to evaluate AI on problems that professional mathematicians actually care about, rather than relying on industry-created benchmarks. The results demonstrate that while AI shows promise as a research assistant, it remains deeply flawed and requires substantial human oversight to produce usable solutions.
The First Proof project released results from its latest round of rigorous mathematical testing, evaluating large language models from OpenAI and academic institutions on problems curated by professional mathematicians. Six to seven of the ten problems were solved essentially correctly by at least one AI system, with ChatGPT-5.5 Pro solving four to five problems. The testing employed expert graders who evaluated responses using standards similar to academic journals ("accept with minor revisions"), and the evaluation process was accelerated from the typical six-month peer review timeline to two intensive days at Harvard. The results revealed that while AI excels at retrieving obscure references and applying mathematical techniques with tireless persistence, it also produces substantial amounts of incorrect or nonsensical output requiring significant human filtering. The test was deliberately limited to publicly available models to serve the broader community, excluding the opaque internal efforts of AI companies that have historically outperformed public versions.
What different sources said
- Scientific AmericanCenter
AI scores a ‘C–’ on its hardest math test yet
Related

Raspberry Pi 5 with 16GB RAM Announced at $350
The Raspberry Pi Foundation has released the Raspberry Pi 5, a new single-board computer available in a 16GB RAM variant priced at $350, with additional 2GB, 4GB, and 8GB options. The device features a 2.4GHz quad-core Arm Cortex-A76 processor, USB 3.0 ports, improved Gigabit Ethernet with PoE capability, and custom in-house silicon (RP1 southbridge) that significantly enhances I/O performance. The upgrade delivers 2-3× CPU performance improvement and substantially faster graphics, camera, and peripheral support compared to the Raspberry Pi 4, though it requires new cases due to its redesigned form factor.
Fabrice Bellard: The Influential French Engineer Behind Core Internet Technologies
Fabrice Bellard, a French engineer born in 1972, created foundational software technologies that power much of the modern internet, including FFmpeg and QEMU, despite remaining largely unknown outside programming circles. His work in multimedia processing and virtualization has become essential infrastructure for video platforms, streaming services, and cloud computing. Bellard's contributions matter because they represent how critical technological innovation often happens outside the public spotlight, with individual engineers solving complex problems that enable billions of users' daily digital experiences.
Apple and Google Update Smart Home Devices to Support Thread 1.4
Apple and Google are updating their smart home streaming devices to Thread 1.4, the latest connectivity specification. Thread is a protocol that underpins the Matter interoperability standard, and these updates enable Thread Border Routers to share credentials and connect more easily to existing networks. This advancement supports the broader goal of seamless smart home device interoperability across different manufacturers.