PublicationsJun 1092% confidence

RealClawBench: New Benchmark Measures AI Agent Performance on Real Developer Tasks

Center 100%

2 sources

Two independent research papers evaluate AI coding and reasoning agents on real-world biological data pipelines, finding that agents can handle discrete, well-defined tasks but struggle with open-ended scientific judgment and end-to-end workflows. One study tested agents on a fly optogenetics pipeline, while the other used multi-omic single-cell datasets across 11 cancer types. The findings matter because they provide empirical benchmarks for where agentic AI can realistically accelerate science and where human domain expertise remains indispensable.

Two preprints published in June 2025 independently assess the capabilities of large language model-based AI agents on complex biological research pipelines, arriving at broadly consistent conclusions. The arXiv study tested general-purpose coding agents on a fly optogenetics data-to-discovery pipeline, using tasks larger than existing benchmarks and evaluation criteria set by domain experts; agents succeeded at individual pipeline stages but failed to complete the full end-to-end workflow correctly. The bioRxiv study deployed a purpose-built Multistep Multimodal Multiomic Agentic (M3A) Framework on multi-omic single-cell cancer datasets, evaluating agents on cell-type annotation, hypothesis generation, and human-AI copilot settings across 11 cancer types. Both studies identify a common bottleneck: agents perform well on structured, deterministic tasks but falter when required to exercise scientific judgment, interpret ambiguous intermediate outputs, or synthesize findings across analytical steps. The arXiv paper further highlights challenges largely absent from standard benchmarks, including computational resource management and generalization to large held-out datasets. The bioRxiv work found that human involvement—especially from domain experts—meaningfully improved outcomes, particularly for methodological guidance and biological synthesis. Together, the studies suggest that stage-level automation is tractable in the near term, while fully autonomous scientific discovery pipelines remain beyond current agent capabilities.

What's missing

Neither study reports direct comparisons against a common standardized benchmark or against each other, making it difficult to assess whether performance differences across the two pipelines reflect agent capability, task difficulty, or evaluation design. Neither paper discusses the computational costs or carbon footprint of running these agents at scale, which is relevant for practical adoption in resource-constrained research settings. Both studies also focus on specific biological domains (optogenetics and cancer multi-omics), leaving open how well findings generalize to other scientific fields.

What different sources said

bioRxivCenter
Evaluating agentic AI for biological discovery in autonomous and copilot settings
arXiv q-bioCenter
PRAXIS: Case-distilled and code-verified AI agents for biological research

Publications

Gut Bacteria Enzyme Found to Break Down Heat-Processed Food Compounds, Producing Novel Biogenic Amines

Researchers have discovered that an enzyme in common gut bacteria can degrade N-epsilon-carboxymethyllysine (CML), a compound formed during thermal food processing, producing previously unknown biogenic amines. The enzyme, ornithine decarboxylase SpeC from enterobacteria, acts on CML and related modified lysine derivatives through a low-level 'underground' catalytic activity. This finding suggests a previously unrecognized communication axis between thermally processed dietary compounds and gut microbial physiology, with potential implications for host health.

1 sourceJun 13

Publications

Full-Length Gene Sequencing Reveals Two Distinct Bacterial Communities in Black-Legged Ticks Expanding Into Canada

Researchers used Oxford Nanopore full-length 16S rRNA gene sequencing to characterize the microbiome of Ixodes scapularis black-legged ticks collected in Nova Scotia, Canada, distinguishing between tick-adapted bacteria and environmentally acquired bacteria. The study comes as I. scapularis — the primary vector of Lyme disease — is rapidly expanding northward into Canada due to climate change. The findings suggest that environmentally derived bacteria in tick microbiomes are not mere contamination, which has implications for how tick microbiome data is collected and interpreted across surveillance studies.

1 sourceJun 13

Publications

Study Identifies Metabolic Link Between Cell Envelope Stress and Biofilm Formation in Bacteria

Researchers have discovered that the metabolite acetyl-CoA directly inhibits enzymes that degrade the bacterial signaling molecule c-di-GMP, connecting cell envelope biosynthesis stress to biofilm formation in Pseudomonas aeruginosa. The study found that sub-inhibitory concentrations of antibiotics targeting early peptidoglycan biosynthesis — but not other antibiotic classes — elevate c-di-GMP levels by reducing phosphodiesterase activity, with acetyl-CoA competing for the enzyme active site. Because the relevant enzyme domain is broadly conserved across bacterial species, this checkpoint mechanism may be widespread and could have implications for understanding antibiotic-induced biofilm responses.

1 sourceJun 13

RealClawBench: New Benchmark Measures AI Agent Performance on Real Developer Tasks

What's missing

What different sources said

Related

Gut Bacteria Enzyme Found to Break Down Heat-Processed Food Compounds, Producing Novel Biogenic Amines

Full-Length Gene Sequencing Reveals Two Distinct Bacterial Communities in Black-Legged Ticks Expanding Into Canada

Study Identifies Metabolic Link Between Cell Envelope Stress and Biofilm Formation in Bacteria