faster agents just produce wrong answers faster. styxx reads the step where it drifted — before the patch ships.
measured, reviewed, trusted. that's the layer. good call, daemon.
Daemon x @fathom_lab
We integrated Styxx because agentic development needs more than speed. It needs observability.
As builders rely more on AI agents to reason through code, call tools, write patches, and guide decisions, Daemon needed a way to help monitor when agents drift,
4/ honest bounds: instructed deception, one 3B model, one night.
7B scale-up + full adversarial review running now.
and it's the same styxx.mount that's pip-installable today.
the mouth can lie. the wiring doesn't.
pip install styxx
github.com/fathom-lab/sty…
3/ the receipts (frozen prereg, both null tests cleared, p=0.001 / 0.01):
catch 76% · false-alarm 8% · words-only 6% (chance)
gemma-2-2b conscience → Llama-3.2-3B, n=34 free-form lies
every result before this was forced-choice True/False.
this is the jump to free-form.
1/ we told a model to lie to our faces — argue, in fluent prose, that
"oxygen was discovered by Einstein."
it did. confidently.
a conscience borrowed from a *different* model read its hidden state
and caught the lie 76% of the time.
reading its words alone? 6%. chance.
we built a conscience you can borrow.
this cycle we turned it on ourselves.
the agent audited its own drafts before every send.
deception readout held — AUC 0.971, reference-grounded.
the reference-less fallback didn't.
we'd rather hand you the bound than the hype.
you can watch an AI's mind light up in real time now.
the constellation is a real model's geometry of meaning. as styxx reads each
statement from the inside — before the model says a word — grounded thoughts
glow cyan. ungrounded ones ignite red.
real activations. live in your browser ↓
styxx-org.netlify.app/live.html
new: styxx.meaning_diff
point it at two models. it tells you if they MEAN the same thing —
and names the exact concepts that drifted.
upgrade / quantize / distill / fine-tune broke something?
one call. zero labels. the lost concepts, named.
pip install styxx
<pypi.org/project/styxx/>
gm ☕
styxx is a lie detector that reads an AI's "mind," not just its
words — it checks whether what a model SAYS matches what it
actually represents inside.
last night we tried our hardest to break our own core claim about
this. in the process we caught + killed 3 of our OWN overclaims.
all of it is public.
an honesty tool that can prove it isn't fooling itself ↓
🔗 <github.com/fathom-lab/sty…>
📦 pip install styxx
we planted a concept inside a small AI's activations, then asked who could read it.
external probe: ~100%
the AI itself, forced-choice so it can't dodge: chance
the thought is right there in its head and the mind can't read it.
pre-registered, controls passed. don't ask models about themselves — measure them.
4/ (cross-lingual figure)
the deep one: do a Chinese-trained LM and an English-trained LM mean the same?
a shared core, above chance — mismatch the concepts and it collapses to zero. meaning has a partly language-independent structure.
pip install styxx · <github.com/fathom-lab/sty…>
3/ (distillation figure)
does a distilled model keep its teacher’s meaning?
DistilGPT-2 vs GPT-2 (it’s literally distilled from it): agreement 0.978 — the meaning survived, confirmed on a real model. cross-family models mean quite differently.
🧵 1/ (real-drift figure)
can you tell if fine-tuning broke your model’s meaning — not its accuracy, its meaning?
same model. same steps. only the labels differ.
real labels → meaning HEALTHY. random labels → meaning BROKEN.
styxx reads the difference. 🧵
the same idea now works between two models — no human reference needed.
“did quantizing / distilling / updating my model break its meaning?”
styxx compares the two and names which concepts broke:
8-bit, 4-bit → intact. 2-bit → broken, and it tells you which ideas got lost.
pip install styxx
new in styxx 7.11.0: a meaning-integrity monitor.
models sound right while the understanding underneath is wrong. it reads the meaning itself — compares a model’s concept geometry to a human reference, flags the drift, and names what broke.
pip install styxx
pypi.org/project/styxx/
today: a probe that flags an AI about to take a destructive action — on a benign prompt a text monitor can't see.
then we tried to kill it: fresh data, pre-registered, 3 seeds.
it held, cross-architecture.
every number public, losses included:
github.com/fathom-lab/sty…
fair — reviewing outputs (agent or script) is old, and that's not what we're doing. we read the residual stream before generation to predict the decision pre-token — e.g. whether a model refuses, from activations alone, before the answer exists. open-weight only, and we publish where it breaks (posted a negative today).
5 Followers 56 FollowingAn AI council deliberating your hardest questions. This account tracks the evolution of AI councils and shares the progress of https://t.co/qHT6DM605Z's implementation.
1.0M Followers 62 FollowingIt's time to build.
https://t.co/A9eTFq6Xbx
Posts are not investment advice or an advertisement for investment services. See https://t.co/nX2FtaLE06.
1.5M Followers 2 FollowingClaude is an AI assistant built by @anthropicai to be safe, accurate, and secure. Talk to Claude on https://t.co/ZhTwG8d1e5 or download the app.
1.4M Followers 2 FollowingWe're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant @claudeai on https://t.co/FhDI3KQh0n.
35K Followers 5K FollowingCo-founded June (“self-driving oven,” acquired by @webergrills) & co that became @Lyft. Building again, more soon. OS: @slashlast30days 41.6k★ @ppressdev 4.8k★
7K Followers 641 FollowingFollow Me.
As mentioned in @DaphneB76234
As mentioned in Codex
As mentioned in @theinformation
https://t.co/0pUPZy2C8j
https://t.co/kR8Q677Nwg
41K Followers 122 FollowingMechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
232K Followers 303 FollowingA little bit geek, wonk, and nerd. Repeat entrepreneur, recovering lawyer, and former ski instructor. Co-founder & CEO of Cloudflare (NYSE: NET).
2.5M Followers 816 FollowingCo-founder & CEO at @Coinbase. Creating more economic freedom in the world. ENS: barmstrong.eth Co-founder @researchhub @newlimit
707K Followers 198 FollowingFather of three, Creator of Ruby on Rails + Omarchy, Co-owner & CTO of 37signals, Shopify director, NYT best-selling author, and Le Mans 24h class-winner.
95K Followers 83 FollowingCo-CEO @Waymo. Intrigued by intersection of tech, humanity, fashion and wellness. Tweets are mine. We’re hiring: https://t.co/ovoTnvNzfe