Raphaël Millière @raphaelmilliere

AI & Cognitive Science @UniofOxford @EthicsInAI Fellow @JesusOxford @raphaelmilliere.com on 🦋 Blog: https://t.co/2hJjfShFfr raphaelmilliere.com Oxford, UK Joined May 2016

Tweets

3K
Followers

11K
Following

3K
Likes

8K

Raphaël Millière @raphaelmilliere

6 days ago

@dubova_marina @cogsci_soc Congrats Marina!

0 0 2 499 0

View Details

Raphaël Millière @raphaelmilliere

a week ago

Great work! See also arxiv.org/abs/2603.05414 from @LedermanHarvey & @kmahowald This is a nice cautionary tale about Morgan's canon in interpretability: "introspection" here is closer to anomaly detection with confabulation than to direct/privileged access to injected content.

Shauli Ravfogel @ravfogel

a week ago

1/ Can LLMs introspect, i.e., reason about their internal states? Recent work claims LLMs notice when their "thoughts" get tampered with, and can report their content. We looked closely and we think it's too early to say that. Work led by @shashwat_s19 , with @tallinzen and me.

8 25 104 15K 58

0 3 17 4K 13

View Details

Raphaël Millière @raphaelmilliere

a week ago

@GoukiMinegishi Thanks! I'll be in Seoul, we should chat

1 0 1 86 0

View Details

Raphaël Millière @raphaelmilliere

2 weeks ago

Some brief comments on the “meat computer” metaphor for humans in today’s New York Times: nytimes.com/2026/05/24/bus…

0 3 3 1K 2

View Details

Raphaël Millière @raphaelmilliere

2 weeks ago

I still occasionally hear people claim that LLMs are hilariously bad at arithmetic. Another reminder that it's not 2022 anymore.

cozyblaze @cozyblazex

2 weeks ago

I redid the multi-digit multiplication experiment, now with gpt-5.5. With medium reasoning and 7 samples each cell, it pretty much aced the test with 99.46% accuracy. The model had no tools to call and had to rely on its reasoning. Can it go further? (1/4)

30 49 932 180K 334

1 4 31 5K 6

View Details

Raphaël Millière @raphaelmilliere

3 weeks ago

News to me! (from this slopfest: startupresearcher.com/news/h-company…)

0 0 3 1K 2

View Details

Raphaël Millière @raphaelmilliere

3 weeks ago

@nikhil07prakash @GoodfireAI Congrats! Excited to see what you work on there

0 0 2 327 0

View Details

Raphaël Millière @raphaelmilliere

3 weeks ago

@francoisfleuret @TMoldwin What do you mean by “knowledge”? 🙃

0 0 11 588 0

View Details

Raphaël Millière @raphaelmilliere

3 weeks ago

@karinavold @TorontoSRI Thanks for having me!

0 0 1 175 0

View Details

Kanishka Misra 🌊 @kanishkamisra

3 weeks ago

New opinion piece on the interface between research on concepts and categories in minds vs. in neural network LMs! I take the position that there is much to be learned from this interface (e.g., learning about concepts from language alone) and outline some directions for future.

2 10 29 2K 16

View Details

Aryaman Arora @aryaman2020

4 weeks ago

all mech interp people are bought into causality, this criticism is very lazy as of ~2 years ago. since this is a subtweet of NLAs, it is worth pointing out that their steering experiments on the poetry and eval awareness tasks *do* test for (in those cases) causality!

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు) @rao2z

4 weeks ago

Guys, stop pestering Mech Interp researchers about causality please! It's this inexplicable obsession with causality that made us lose beautiful sciences like Astrology, Palmistry and Phrenology! 😡

9 12 121 27K 14

5 4 127 15K 45

View Details

Raphaël Millière @raphaelmilliere

4 weeks ago

@littmath POV you're Spinoza

0 0 17 3K 11

View Details

Aryaman Arora @aryaman2020

4 weeks ago

pov: you are a natural language autoencoder and you are aware you are being subject to evals by Redwood Research. do you fake writing out a coherent cot or truthfully say "the math problem is giving me 92ish vibes"?

Ryan Greenblatt @RyanPGreenblatt

4 weeks ago

How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)

5 10 181 27K 70

4 9 125 11K 68

View Details

Raphaël Millière @raphaelmilliere

4 weeks ago

@elyasbuilds I like activation steering as much as the next guy, but this isn't what I was referring to: x.com/raphaelmillier…

Raphaël Millière @raphaelmilliere

4 weeks ago

@jatin_n0 Mostly a joke, it's a cool paper! yes the planning result is causal but only looking at total effect (i.e. an NLA-derived resid stream edit changes the output). I was referring to causal effect on the model's downstream computations, not anything inside/after the autoencoder. 1/2

1 1 7 715 1

0 0 0 231 0

View Details

Raphaël Millière @raphaelmilliere

4 weeks ago

Anthropic @AnthropicAI

4 weeks ago

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

595 2K 16K 2.5M 9K

6 5 138 13K 35

View Details

Raphaël Millière @raphaelmilliere

4 weeks ago

@jatin_n0 An additive AR-difference vector can change the output while acting as a broad steering perturbation without showing that the described content actually maps onto the operative feature in the model's putative "rhyme-planning" circuit 3/3

1 1 3 264 0

View Details

Raphaël Millière @raphaelmilliere

4 weeks ago

@jatin_n0 It's missing is evidecne about causal mediation: whether the NLA-described "rabbit plan" is the variable later components read, whether the edit produces a coherent "mouse plan" in later layers/tokens, whether ablating/patching intermediate states blocks or restores the effect 2/