PublicationsJun 1183% confidence

DataEvolver: New System Automatically Prepares Training Data for Large Language Models

Center 100%

1 source

Researchers have introduced DataEvolver, a self-evolving data preparation system that automatically constructs and refines pipelines to transform raw data into high-quality training data for large language models. Unlike existing methods that rely on fixed pipelines or manual human instructions, DataEvolver uses a multi-level mechanism operating at both the operator and pipeline levels to iteratively improve data quality. Experiments across seven benchmarks show an average 10% gain in downstream LLM performance compared to training on unprocessed data, suggesting a path toward iterative co-evolution of models and their training data.

DataEvolver is a newly proposed automatic data preparation framework designed to reduce the costly manual curation typically required to produce high-quality training data for large language models. The system operates through a multi-level self-evolving mechanism: at the operator level, it incrementally expands a set of data transformation operations while resolving dependency conflicts; at the pipeline level, it instantiates logical plans into executable code and refines them through a feedback loop that minimizes the gap between prepared data and high-quality reference examples. This approach distinguishes DataEvolver from prior methods, which depend on predefined pipelines or customized human instructions and therefore struggle to adapt to diverse data distributions. Evaluated on seven benchmarks, the system achieved an average 10% improvement in downstream LLM performance relative to training on original, unprocessed data. The authors frame the results as evidence for a broader opportunity: the iterative co-evolution of LLMs and the data used to train them. The preprint was submitted to arXiv in early June 2026 and is categorized under both Databases and Artificial Intelligence.

What's missing

The paper has not yet undergone formal peer review, as it is a preprint. Key open questions include whether the 10% performance gain holds consistently across different model scales and architectures, what computational overhead DataEvolver introduces relative to manual curation, and how the system performs when high-quality reference examples are scarce or unavailable.

What different sources said

arXiv cs.AICenter
DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving

Publications

Gut Bacteria Enzyme Found to Break Down Heat-Processed Food Compounds, Producing Novel Biogenic Amines

Researchers have discovered that an enzyme in common gut bacteria can degrade N-epsilon-carboxymethyllysine (CML), a compound formed during thermal food processing, producing previously unknown biogenic amines. The enzyme, ornithine decarboxylase SpeC from enterobacteria, acts on CML and related modified lysine derivatives through a low-level 'underground' catalytic activity. This finding suggests a previously unrecognized communication axis between thermally processed dietary compounds and gut microbial physiology, with potential implications for host health.

1 sourceJun 13

Publications

Full-Length Gene Sequencing Reveals Two Distinct Bacterial Communities in Black-Legged Ticks Expanding Into Canada

Researchers used Oxford Nanopore full-length 16S rRNA gene sequencing to characterize the microbiome of Ixodes scapularis black-legged ticks collected in Nova Scotia, Canada, distinguishing between tick-adapted bacteria and environmentally acquired bacteria. The study comes as I. scapularis — the primary vector of Lyme disease — is rapidly expanding northward into Canada due to climate change. The findings suggest that environmentally derived bacteria in tick microbiomes are not mere contamination, which has implications for how tick microbiome data is collected and interpreted across surveillance studies.

1 sourceJun 13

Publications

Study Identifies Metabolic Link Between Cell Envelope Stress and Biofilm Formation in Bacteria

Researchers have discovered that the metabolite acetyl-CoA directly inhibits enzymes that degrade the bacterial signaling molecule c-di-GMP, connecting cell envelope biosynthesis stress to biofilm formation in Pseudomonas aeruginosa. The study found that sub-inhibitory concentrations of antibiotics targeting early peptidoglycan biosynthesis — but not other antibiotic classes — elevate c-di-GMP levels by reducing phosphodiesterase activity, with acetyl-CoA competing for the enzyme active site. Because the relevant enzyme domain is broadly conserved across bacterial species, this checkpoint mechanism may be widespread and could have implications for understanding antibiotic-induced biofilm responses.

1 sourceJun 13

DataEvolver: New System Automatically Prepares Training Data for Large Language Models

What's missing

What different sources said

Related

Gut Bacteria Enzyme Found to Break Down Heat-Processed Food Compounds, Producing Novel Biogenic Amines

Full-Length Gene Sequencing Reveals Two Distinct Bacterial Communities in Black-Legged Ticks Expanding Into Canada

Study Identifies Metabolic Link Between Cell Envelope Stress and Biofilm Formation in Bacteria