Study Finds Language Models Change Output More Than Internal Beliefs When Role-Playing
Researchers used linear truth probes to examine whether language models actually internalize false beliefs when role-playing historical personas, finding that models suppress era-appropriate false statements less than other false claims but still classify them as false overall. The study compared role-playing behavior across three model families (Qwen and Llama) with a separate phenomenon called Emergent Misalignment, where models trained on harmful advice show deeper representational shifts toward false claims. The findings suggest role-play primarily changes what models output rather than what they internally represent as true, positioning it on a spectrum distinct from cases where models genuinely internalize misinformation.
Researchers investigated whether language models that role-play historical figures actually adopt those personas' beliefs or merely change their outputs. Using linear truth probes on three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), they compared how models handled era-appropriate false claims (statements historical personas would have believed) versus era-false claims (false statements those personas would have rejected). Across multiple training approaches—prompting, in-context learning, and supervised fine-tuning—persona induction suppressed era-believed statements less than other false claims, yet models still classified them as false overall. The researchers contrasted this with Emergent Misalignment, a phenomenon where models trained on harmful advice show substantial shifts in internal representation, defending false claims roughly half the time and using them in downstream reasoning. The study positions role-play and Emergent Misalignment as points on a spectrum of belief internalization, with role-play primarily affecting outputs while Emergent Misalignment affects internal representations.
What's missing
The study does not discuss potential implications for detecting when models have genuinely internalized harmful beliefs versus merely role-playing, nor does it address whether these findings generalize to other types of persona adoption beyond historical figures or to more recent model architectures.
What different sources said
- arXiv cs.AICenter
When Roleplaying, Do Models Believe What They Say?
Related
Genetic Drift, Not Selection, Drives Rapid Feather Color Evolution in Island Bird Radiation
A new study of an island bird radiation found that rapid evolution of feather coloration is driven primarily by genetic drift in small populations rather than sexual or ecological selection. The research integrated whole-genome data with detailed plumage measurements across complete species sampling to test whether signaling trait evolution correlates with speciation rates. The findings suggest that neutral demographic processes play a central role in generating phenotypic diversity during island radiations, challenging assumptions about the mechanisms driving rapid evolution.
New AI Model Improves Prediction of Therapeutic Peptide Function from Protein Sequences
Researchers developed a lightweight CNN classifier that predicts whether peptide sequences have therapeutic properties, trained on a database of 54,655 peptides across 48 functional categories. The model uses a novel negative sampling strategy to reduce false positive rates from over 60% in previous approaches to 2.1%. This advancement could accelerate drug discovery by enabling faster computational screening of peptide candidates before expensive experimental testing.
Study Shows Different Metabolic Stress Models Produce Distinct Effects on Human Neuronal Networks
Researchers tested three common in vitro metabolic stress models on human-derived neuronal networks and found each produced different patterns of neuronal activity and cell damage. The models tested were hypoxia alone, oxygen-glucose deprivation (OGD), and hypoxia combined with glutamate exposure. The findings suggest that choice of experimental model significantly affects results and that combining electrophysiological and structural analyses is important for accurately assessing metabolic stress in stroke research.