EvoBrowseComp: New Benchmark Tests Search Agents on Evolving, Contamination-Free Knowledge
Researchers introduced EvoBrowseComp, a new benchmark with 400 English and 400 Chinese questions designed to evaluate search agents (LLMs with search tools) on current, non-static knowledge. The benchmark addresses limitations of existing static benchmarks like BrowseComp, which are vulnerable to test-set contamination and allow models to succeed through memorization rather than genuine retrieval reasoning. This matters because it provides a more rigorous, updatable evaluation method that can keep pace with both evolving world knowledge and improving AI capabilities.
EvoBrowseComp is an evolving benchmark created to evaluate search agents—large language models augmented with search capabilities—on their ability to retrieve and reason about current information rather than relying on memorized facts. The benchmark uses a three-agent collaborative framework: a QA synthesis agent that retrieves fresh knowledge from the live web, an information filtering agent that ensures credibility and blocks parametric shortcuts, and a guidance agent that structures questions into reasoning graphs to eliminate logical redundancy. With 800 total questions (400 in English, 400 in Chinese) synthesized through live-web traversal, the benchmark is designed to be fully automated and regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm the benchmark's difficulty and its requirement for broad horizontal search capabilities. The work establishes a scalable paradigm for creating high-difficulty, auto-updatable benchmarks that can evolve alongside both world knowledge and advancing agent capabilities.
What different sources said
- arXiv cs.CLCenter
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
Related
Topology-Aware Thermodynamics Improves DNA Probe Specificity Design
Researchers developed a new framework for designing DNA probes that accounts for the spatial organization of matched sequences, not just overall thermodynamic stability. Traditional methods rely on scalar measures like melting temperature and free energy, which miss how mismatches are distributed along the probe. The approach could improve diagnostic accuracy in applications like HPV detection and gene expression profiling.
Study Identifies Optimal Thermal Dose for Combining Focused Ultrasound with Immunotherapy in Tumors
Researchers used multimodal PET imaging to identify an optimal thermal dose range for focused ultrasound ablation that destroys tumor tissue while preserving conditions for immunotherapy delivery. The study found that excessive heating collapses blood vessels needed for antibody access, while insufficient heating fails to adequately reduce tumor burden. The findings could guide clinical design of combination treatments pairing thermal ablation with immunotherapies.
Plant MSH1 Protein Functions as Mismatch-Directed Nuclease for Organelle Genome Maintenance
Researchers have identified the precise mechanism by which the AtMSH1 protein in Arabidopsis plants recognizes and cleaves DNA mismatches and lesions, preventing mutations in organellar genomes. The protein combines a DNA mismatch recognition module with a nuclease domain that makes staggered cuts at specific positions relative to DNA damage. This discovery explains how plants maintain unusually low mutation rates in their mitochondrial and chloroplast DNA compared to other eukaryotes.