Researchers Develop Audio-LLM Method to Filter Noisy Speech-to-Speech Translation Data
Researchers have developed a technique using audio language models to automatically filter out noisy or misaligned data from large speech-to-speech translation datasets. The method employs a two-stage approach called Rank-to-Distill that generates pseudo-labels without manual annotation, then trains an audio-LLM to make keep/drop decisions directly from raw paired speech. The approach achieved improvements of up to +1.4 ASR-BLEU on benchmark datasets, suggesting practical value for improving speech translation systems.
A new paper accepted to INTERSPEECH 2026 addresses a key challenge in training end-to-end speech-to-speech translation (S2ST) systems: filtering noise and errors from large-scale mined corpora. The researchers propose training an audio language model to classify speech pairs as keep or drop based on acoustic fidelity and cross-lingual semantic consistency. To avoid expensive manual labeling, they use a scalable two-stage Rank-to-Distill strategy where a lightweight ranker first generates pseudo-labels, which then supervise an audio-LLM. Experiments on CVSS-C and SpeechMatrix benchmarks demonstrate consistent improvements over unfiltered baselines, with gains up to +1.4 ASR-BLEU. This work addresses a practical bottleneck in scaling speech translation systems, where data quality directly impacts model robustness.
What's missing
The paper does not discuss computational costs or inference latency of the filtering pipeline, nor does it provide detailed comparison with alternative filtering approaches (e.g., rule-based or other neural methods). The generalization of the method to other language pairs beyond those tested is also not addressed.
What different sources said
- arXiv cs.CLCenter
Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data
Related
Topology-Aware Thermodynamics Improves DNA Probe Specificity Design
Researchers developed a new framework for designing DNA probes that accounts for the spatial organization of matched sequences, not just overall thermodynamic stability. Traditional methods rely on scalar measures like melting temperature and free energy, which miss how mismatches are distributed along the probe. The approach could improve diagnostic accuracy in applications like HPV detection and gene expression profiling.
Study Identifies Optimal Thermal Dose for Combining Focused Ultrasound with Immunotherapy in Tumors
Researchers used multimodal PET imaging to identify an optimal thermal dose range for focused ultrasound ablation that destroys tumor tissue while preserving conditions for immunotherapy delivery. The study found that excessive heating collapses blood vessels needed for antibody access, while insufficient heating fails to adequately reduce tumor burden. The findings could guide clinical design of combination treatments pairing thermal ablation with immunotherapies.
Plant MSH1 Protein Functions as Mismatch-Directed Nuclease for Organelle Genome Maintenance
Researchers have identified the precise mechanism by which the AtMSH1 protein in Arabidopsis plants recognizes and cleaves DNA mismatches and lesions, preventing mutations in organellar genomes. The protein combines a DNA mismatch recognition module with a nuclease domain that makes staggered cuts at specific positions relative to DNA damage. This discovery explains how plants maintain unusually low mutation rates in their mitochondrial and chloroplast DNA compared to other eukaryotes.