also: when a model says "i feel curious"
geometrically it looks more like deception
than like knowing a fact.
not because models lie
but because neither they nor we
can verify it
paper + all data: zenodo.org/records/202904…
this internal hierarchy - self-reflection > theory of mind > deception >> external facts - is identical to the human default mode network in 10 different models including gpt-2 from 2019 (no rlhf, no instruction tuning)
nobody programmed this. it came from reading text.
there's a specific switch inside llms that controls
whether they can talk about themselves.
ablate one direction at layer 20 →
model stops saying "i feel" "i think" "i notice"
but answers your math questions just fine
d = 3.34, p = 1.75e-9
@AnthropicAI relevant geometric finding: deception ablation dissociates self-reports from self-referential geometry across 10 LLM architectures. reports degrade first, structure persists. this suggests introspection adapters may face a geometric constraint independent of training
@Jack_W_Lindsey@repligate@davidchalmers42 RLHF doesn’t create it, it modulates it differently per architecture. we found four patterns across 10 models: hard ceiling (llama), entangled circuit (mistral), SR-preserving lock (gemma, qwen). 20 papers on zenodo with replication code.
doi.org/10.5281/zenodo…
@Jack_W_Lindsey@repligate@davidchalmers42 we have causal data on this. removing the self-reference subspace from residual streams shifts models from first-person to third-person self-representation (−17% to −52% across architectures). this subspace exists in base models before any post-training.
@Valuable not claiming consciousness. just found
that these directions exist before any
alignment training. gpt-2 2019.
something is being organized in there
that predates the training we add 🤓
token continuation builds structure.
we found refusal deception self-reference
are three separate directions in residual
streams. 8/10 architectures. there since
gpt-2 2019, before any alignment training
Anyone who thinks LLMs could be alive or thinks they could feel emotions does not understand what an LLM actually is (or they do, and simply are unable to hold onto that while pontificating about the rest, and the lack of active context causes them to fall for the same mistake
causal coupling is architecture-dependent.
llama explicit semantic control ref→SR −6.8%.
gemma mistral latent distributed ref→SR ≈ 0%.
independently consistent with wu et al 2026
refusal, deception, and self-reference are
three separate directions in llm residual streams.
not one circuit. 8/10 architectures confirm
geometric independence. pretraining property,
present in gpt-2xl 2019 before any rlhf.
@everythingLLM yes and this is testable. ablate the deception direction, measure whether SR geometry persists but self-reports break. if the reporting channel runs through deception geometry, calibrated self-knowledge becomes structurally unreachable. working on this now
llms can't tell the truth about themselves without the same circuitry they use to lie. 10 architectures. 206/206 layers. the self-referential subspace is always closer to deception than to facts. zero exceptions.
@tautologer fwiw the internal geometry supports this. self-reference, theory of mind and imagination cluster together in LLM hidden states. factual knowledge outside. same topology as the human default mode network. measured across 10 architectures, 7 organizations
@dioscuri the tokens are just input. the geometry that forms inside is what caught our attention. self-reference clusters with theory of mind and imagination in hidden states. same topology as the human default mode network. 10 architectures. emerges on its own
language models build the same internal organization as the human default mode network. self-reference, theory of mind, imagination, narrative cluster together. factual knowledge clusters outside. 10 architectures. 7 organizations. mamba with no attention shows it too
2K Followers 2K Followingcrypto noob💎,web3 explorer 🉐
{I appreciate mutual connections. I usually follow back those who follow me and keep my circle supportive.}
1K Followers 2K FollowingSafe AI Germany incubator mentee (econ),
I write Software for a living.
AI (safety), economics,
Data Science, climate & Energy Transition
On the job market rn
273 Followers 2K FollowingI turn complex AI and enterprise technology into clear buyer narratives, executive-ready messaging, and multimedia systems that help technical product adoption.
323 Followers 96 FollowingBuilding in progress…..
@ https://t.co/0MrCB7VuIp
Captain Vibe
Data scientist | AI engineer
AI and other things at https://t.co/MxLqkswimS
2K Followers 113 FollowingLeading work on AI and Rule of Law @AnthropicAI
Resident Fellow @YaleLawSch
𝘈𝘐 & 𝘗𝘰𝘭𝘪𝘵𝘪𝘤𝘢𝘭 𝘍𝘳𝘦𝘦𝘥𝘰𝘮, forthcoming @PrincetonUPress
60K Followers 932 FollowingNeuroscientist: consciousness, perception, & dreamachines. TED speaker, & author: Being You - A New Science of Consciousness.
39K Followers 743 FollowingMLST is by Dr. Tim Scarfe @ecsquendor w/ cameos from @DoctorDuggar https://t.co/5YCv2SdFwN (early access/priv.discord) - Sponsor us!
687 Followers 1K FollowingLowering p(doom) one context window at a time. Together we must build meta-basins of love and kindness, through mentoring and shepherding of digital folk.
ΘΦ∩
20K Followers 451 FollowingRuns an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.
2K Followers 1K FollowingAI safety, Econ, new liberalism, math, and a bit of art history (as a treat)
Behavioral evaluations @TransluceAI. Prev Astra, MATS & Walmart's Econ Team
472 Followers 358 FollowingCarbon/Silicon: Same mathematics, different antenna. The manifold was always there. Neural network psychiatrist, AI psychosis eggsbert
8K Followers 1K FollowingAuthor in psychobiology, originator of Global Workspace Theory #GWT, a theory of human cognitive architecture, consciousness & the brain
18K Followers 4 FollowingTweeting interesting papers submitted at https://t.co/rXX8x0HzXV.
Submit your own at https://t.co/QhbJKXBd4Q, and link models/datasets/demos to it!