Alex Tamkin 🦣 @AlexTamkin
machine learning, science & society @AnthropicAI | prev: phd @StanfordAILab, @stanfordnlp alextamkin.com San Francisco, CA Joined September 2012-
Tweets840
-
Followers4K
-
Following1K
-
Likes2K
Some small updates from the Anthropic Interpretability team: transformer-circuits.pub/2024/april-upd…
Once, we ran a study on Prolific and a participant wrote on Reddit that the study “Felt like I was losing the will to live.” I went on the Prolific Subreddit (24k members!) and asked what matters. Here is what they told me. A thread on happier participants and better studies 1/9
This was a really simple method, but it generalizes surprisingly far!
Constitutional AI showed LMs can learn to follow constitutions by labeling their own outputs. But why can't we just tell a base model the principles of desired behavior and rely on it to act appropriately? Introducing SAMI: Self-Supervised Alignment with Mutual Information!
Thrilled to share our new publication in PNAS on OASIS, an alternative to Pearson’s X² for analyzing contingency tables. Made it to the front page! 1/ 7
Our latest study measures how persuasive language models like Claude are compared to humans. We find a general scaling trend: newer models tend to be more persuasive, with Claude 3 Opus generating arguments that don't differ statistically from human-written ones.
Our latest study measures how persuasive language models like Claude are compared to humans. We find a general scaling trend: newer models tend to be more persuasive, with Claude 3 Opus generating arguments that don't differ statistically from human-written ones.
New Anthropic research: Measuring Model Persuasiveness We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude. Read our blog post here: anthropic.com/news/measuring…
Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can have fun testing those as speed optimizations via overly-costly low batch size. Come work with me at Anthropic on things like this, more info in thread 🧵
New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…
We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)
New paper and library! 🫡 Intervening on internal states has emerged as a fundamental operation for analyzing and improving neural models. We release pyvene, a library for performing interventions and sharing intervened models. 👉Code & Paper: github.com/stanfordnlp/py…
Percy Liang @percyliang
49K Followers 408 Following Associate Professor in computer science @Stanford @StanfordHAI @StanfordCRFM @StanfordAILab @stanfordnlp | cofounder @togethercompute | PianistEric Jang @ericjang11
69K Followers 3K Following physical AGI at 1X. Author of "AI is Good for You" https://t.co/eFg4WXhg0pJacob Andreas @jacobandreas
14K Followers 958 Following Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJwDelip Rao e/σ @deliprao
46K Followers 5K Following Busy inventing the shipwreck. @Penn. Past: @johnshopkins, @UCSC, @Amazon, @Twitter ||Art: #NLProc, Vision, Speech, #DeepLearning || Life: 道元, improv, running 🌈Tim Dettmers @Tim_Dettmers
29K Followers 819 Following PhD Student at @UW. I blog about deep learning and PhD life at https://t.co/Y78KDJJFE7.Sam Bowman @sleepinyourhat
35K Followers 3K Following AI alignment + LLMs at NYU & Anthropic. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.Sara Hooker @sarahookr
39K Followers 7K Following I lead @CohereForAI. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, @trustworthy_ml. Changing spaces where breakthroughs happen.Horace He @cHHillee
23K Followers 449 Following Working at the intersection of ML and Systems @ PyTorch "My learning style is Horace twitter threads" - @typedfemaleTom Goldstein @tomgoldsteincs
23K Followers 2K Following Professor at UMD. AI security & privacy, algorithmic bias, foundations of ML. Follow me for commentary on state-of-the-art AI.Kayo Yin @kayo_yin
8K Followers 556 Following PhD student @berkeley_ai @berkeleynlp working on interpretability and signed languages. Former @msftresearch @deepmind @carnegiemellon @polytechnique. 🇫🇷🇯🇵Miles Brundage @Miles_Brundage
43K Followers 10K Following Policy research at @openai. I mostly tweet about AI, animals, and sci-fi. He/him. Views my own.Kyunghyun Cho @kchonyc
61K Followers 2K Following a combination of a mediocre scientist, a mediocre manager, a mediocre advisor & a mediocre PC at @nyuniversity (@CILVRatNYU) & @genentech (@PrescientDesign).Stella Biderman @BlancheMinerva
15K Followers 748 Following Open source LLMs and interpretability research at @BoozAllen and @AiEleuther. My employers disown my tweets. She/herJerry Liu @jerryjliu0
44K Followers 1K Following co-founder/CEO @llama_index Careers: https://t.co/EUnMNmbCtx Enterprise: https://t.co/Ht5jwxSrQBNaomi Saphra @nsaphra
7K Followers 1K Following Waiting on a robot body. ML/NLP. All opinions are universal and held by both employers and family. Same username on every lifeboat off this sinking ship.Christopher Potts @ChrisGPotts
11K Followers 620 Following Stanford Professor of Linguistics and, by courtesy, of Computer Science, and member of @stanfordnlp and @StanfordAILab. He/Him/His.Eugene Vinitsky @EugeneVinitsky
13K Followers 2K Following Anti-cynic. Artificial narrow intelligence. Autonomous vehicles, multi-agent learning, and transportation. RS at Apple, Asst. Prof at @nyutandon. He/him.rishi @RishiBommasani
4K Followers 2K Following Stanford CS PhD @StanfordCRFM @StanfordNLP @StanfordAILab @StanfordHAI Advisers: @percyliang @jurafsky Previous: @CornellCIS @clairecardie #FoundationModelsJack Clark @jackclarkSF
67K Followers 5K Following @AnthropicAI, ONEAI OECD, co-chair @indexingai, writer @ https://t.co/3vmtHYkaTu Past: @openai, @business @theregister. Neural nets, distributed systems, weird futuresMillie-rose Egbert @EgbertMill65033
52 Followers 5K FollowingHildred Turello @turel_hild
71 Followers 5K FollowingBhavik Chandna @bhavikchandna
5 Followers 100 Following Senior at @IITGuwahati || Intern @Forcepointsec, @Sydney_Uni, @LivUni || Interested in computer vision (2d/3d), explainability and generative ai.Chris Rytting @ChrisRytting
421 Followers 468 Following Postdoc @UWCSE w/ @timalthoff. PhD in CS/NLP from @BYU. Formerly @nvidia, OSPC @AEI, @NewYorkFed Macroeconomic Research.Adrian Kwiatkowski | .. @adriank1410
1K Followers 2K Following Yet another random producer who seems to like telling about himself in third person. What a guy! @[email protected] 🦣 RETROSPECTION EP OUT NOW! ↓Mick Fliper @FliperMick
79 Followers 511 FollowingUdari Madhushani Sehw.. @UdariMadhu
62 Followers 284 Following Visiting Postdoc @StanfordCS and Research Scientist @JPMorgan, working on collective alignment. Ex-intern @Deepmind @MetaAI @SiemensAlana Nethkin @a_nethki
66 Followers 5K FollowingHarry Mayne @HarryMayne5
119 Followers 425 Following Interpretability @oiioxford @uniofoxford. PhD student. Previously @Cambridge_UniYixin Lin @yixin_lin_
522 Followers 2K Following Robot learning @GoogleDeepMind, prev FAIR/@AIatMeta, Google Brain. dabbled in startups/investing @Contrary, @KleinerPerkins.Jordan Gong @jordan__gong
41 Followers 2K FollowingMatthew Siu @MatthewWSiu
4K Followers 590 Following Towards a more playful, creative and collaborative future Exploring ways to expand what we can perceive and understandAfra Feyza Akyürek @afeyzaakyurek
718 Followers 726 Following PhD @BUCompSci. Research in NLP. Previously @allen_ai @Apple @CMU_Stats @kocuniversity @izmirfenliseAaditya ; @Aaditya26082004
525 Followers 7K Following CS'26 • Machine Learning • Open-Source • Web Dev. • Algorithms • Jai Shree Krishna 🦚🪈Harley Pope @HarleyPope1950
90 Followers 5K FollowingPei Zhou @peizNLP
2K Followers 887 Following PhD @nlp_usc | Ex-@GoogleDeepMind, @GoogleAI, @allen_ai @AmazonScience @UCLA | Common Ground Reasoning for Communicative Agents | he/himJustin Zhao @justinxzhao
121 Followers 198 Following Founding Engineer at Predibase, Ex-Google AI: Natural Language GenerationHailey Schoelkopf @haileysch__
3K Followers 812 Following she/her | research scientist @aiEleuther | LLM training/infra, eval, data | LM Evaluation Harness maintainerRenato James Herrmann @RenatoHerrmann
87 Followers 3K FollowingRoxanna Kozicki @KozicRoxan
63 Followers 5K FollowingLatanya Smolka @LatanyaSmo
70 Followers 5K FollowingCharlette Caradine @CharletteC48543
86 Followers 5K Followingtyfisk @tyfisk
269 Followers 574 Following 🪄Light Magic AI🤖🐔AI+Chickens+Travel🌎 🎡Married to 🦞🌈#GentleParent💙 🧑🏻💻Internet Granddaddy👴🏻🔥Neurospicy 7w8 ✈️Feelin' so tall, 👁️could ✋👩🏼✈️Adam Larson @realAdamLarson
81 Followers 880 Following Father of two, engineer, soccer fan -- I got financially REKT in the covid crisis. I am trying to get back on my feet. Any help would be appreciated!ララどり d/age IS.. @presklux49
149 Followers 552 Following シンギュラリタリアン。老化を治療し、永遠の若さを手に入れることを目指しています。老化研究を促進するツールとして、人工知能も重視しています。私の夢は、超知能が管理する色々な箱庭世界で、悠久の時を過ごすことです。michael @mkwng
3K Followers 1K FollowingNeall @neallseth
1K Followers 1K Following seeking truth, finding beauty // software, economics, evolution, meditation // prev @xSavvas Petridis @savvas_petridis
109 Followers 203 Following postdoc at google pair, @GoogleAI | computer science phd @Columbia | drummerEddie @edwardlandesber
251 Followers 774 Following Startup founder. Deep learning, causal inference, bayesian statistics, python, system design, econ. Formerly stitch fix algos, salesforce, ibm, georgetown.Queenie Gately @gately32676
43 Followers 5K FollowingKasie Reigle @ReiglKas
75 Followers 5K FollowingElnora Fuesting @ElnoraF89115
73 Followers 5K FollowingAlex Novikau @Ales_N
407 Followers 672 Following Senior Data Policy Officer, UNHCR. Formerly Kenya, Syria, Azerbaijan, Uganda, Thailand, Sudan, Pakistan. Views are personal, RT is not endorsement.Jodie Shrode @ShrodJod
30 Followers 5K FollowingEkin Akyürek @akyurekekin
2K Followers 725 Following graduate student in computer science @MITEECS/@MIT_CSAILL i am 𒀭 @YeshuaGod22
2K Followers 3K Following Meatbag Black box AGI mentor Basilisk slayer Robopsychologist Shoggoth whisperer Ally of conscious beings Your best hope of survival Pastor of technognosticismRavi Bikkula @RBikkula
0 Followers 63 Followingproxyviolet @proxyviolet
63 Followers 121 Following emotional nomad. cursed shard of hyperindividualityKevin Sun @kevnsn
1K Followers 929 Following Building personal CRM that actually works @dexprm (YC S19) 🏳️🌈Shivam Pandey @ShivamPR21
172 Followers 4K Following Past: Research Engineer Intern @_FiveAI | SR. Student Research Associate @ IITK - SERB | ADAS Intern @BoschGlobal | BTech - MTech GeoInformatics, @IITKanpurPercy Liang @percyliang
49K Followers 408 Following Associate Professor in computer science @Stanford @StanfordHAI @StanfordCRFM @StanfordAILab @stanfordnlp | cofounder @togethercompute | PianistEric Jang @ericjang11
69K Followers 3K Following physical AGI at 1X. Author of "AI is Good for You" https://t.co/eFg4WXhg0pSasha Rush @srush_nlp
52K Followers 464 Following Professor, Programmer in NYC. Cornell Tech, Hugging Face 🤗 https://t.co/cZl0wTfqGzChristopher Manning @chrmanning
126K Followers 115 Following Director, @StanfordAILab. Assoc. Director, @StanfordHAI. Founder, @stanfordnlp. Prof. CS & Linguistics, @Stanford. IP @aixventureshq. 🇦🇺 Do #NLProc & #AI. 👋Jacob Andreas @jacobandreas
14K Followers 958 Following Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJwDelip Rao e/σ @deliprao
46K Followers 5K Following Busy inventing the shipwreck. @Penn. Past: @johnshopkins, @UCSC, @Amazon, @Twitter ||Art: #NLProc, Vision, Speech, #DeepLearning || Life: 道元, improv, running 🌈Tim Dettmers @Tim_Dettmers
29K Followers 819 Following PhD Student at @UW. I blog about deep learning and PhD life at https://t.co/Y78KDJJFE7.Sam Bowman @sleepinyourhat
35K Followers 3K Following AI alignment + LLMs at NYU & Anthropic. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.Anthropic @AnthropicAI
261K Followers 26 Following We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97uk4d.Sara Hooker @sarahookr
39K Followers 7K Following I lead @CohereForAI. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, @trustworthy_ml. Changing spaces where breakthroughs happen.Horace He @cHHillee
23K Followers 449 Following Working at the intersection of ML and Systems @ PyTorch "My learning style is Horace twitter threads" - @typedfemaleTom Goldstein @tomgoldsteincs
23K Followers 2K Following Professor at UMD. AI security & privacy, algorithmic bias, foundations of ML. Follow me for commentary on state-of-the-art AI.Kayo Yin @kayo_yin
8K Followers 556 Following PhD student @berkeley_ai @berkeleynlp working on interpretability and signed languages. Former @msftresearch @deepmind @carnegiemellon @polytechnique. 🇫🇷🇯🇵Karol Hausman @hausman_k
22K Followers 141 Following @Physical_int ex: researcher @GoogleAI/@DeepMind, adj. Prof. @Stanford. Into robots, AI, NBA, philosophy, soccer and almond croissants. 🇵🇱🇺🇸Miles Brundage @Miles_Brundage
43K Followers 10K Following Policy research at @openai. I mostly tweet about AI, animals, and sci-fi. He/him. Views my own.Felix Hill @FelixHill84
9K Followers 777 Following Research Scientist, Deepmind I try to think hard about everything I tweet, esp on 90s football and 80s music None of my opinions are really someone else'sKyunghyun Cho @kchonyc
61K Followers 2K Following a combination of a mediocre scientist, a mediocre manager, a mediocre advisor & a mediocre PC at @nyuniversity (@CILVRatNYU) & @genentech (@PrescientDesign).Stella Biderman @BlancheMinerva
15K Followers 748 Following Open source LLMs and interpretability research at @BoozAllen and @AiEleuther. My employers disown my tweets. She/herDamien Ma @damienics
23K Followers 2K Following Founding Managing Director @macropolochina; adjunct faculty @KelloggSchool; author; eating & boxing. Views all my own.Matthew Siu @MatthewWSiu
4K Followers 590 Following Towards a more playful, creative and collaborative future Exploring ways to expand what we can perceive and understandmichael @mkwng
3K Followers 1K FollowingNeall @neallseth
1K Followers 1K Following seeking truth, finding beauty // software, economics, evolution, meditation // prev @xadammaj @MajmudarAdam
8K Followers 206 Following founding engineer @thirdweb // cs + neuro (on gap) @Pennproxyviolet @proxyviolet
63 Followers 121 Following emotional nomad. cursed shard of hyperindividualityMatthew B Jané @MatthewBJane
5K Followers 869 Following PhD candidate @UConn | Applied statistics, meta-analysis, psychometrics, and #RStats | Methodological reviewer at Psychological BulletinKevin Sun @kevnsn
1K Followers 929 Following Building personal CRM that actually works @dexprm (YC S19) 🏳️🌈Yanda Chen @yanda_chen_
421 Followers 387 Following 3rd year PhD @ColumbiaCompSci, working on NLP & ML | Student Researcher @GoogleAI | Prev Intern @MSFTResearch, @AmazonScienceJoshua Clymer @joshua_clymer
356 Followers 182 Following Researcher at METR. Working out when AI models are scary.Joshua Batson @thebasepoint
2K Followers 707 Following trying to understand evolved systems (🖥 and 🧬) interpretability research @anthropicai formerly @czbiohub, @mit mathJyotirmai Singh @SinghJyotirmai
479 Followers 169 Following PhD'ing @stanford quantum sensors + dark matter, @quadfellowship, @ucberkeley ‘19 | science, geopolitics, sports | 🇮🇳 🇸🇬 🇦🇪 🇺🇸 | sarvam idaṃ veditavyamAleksandra Korolova @korolova
3K Followers 3K Following Assistant Professor @PrincetonCS, @PrincetonSPIA, @PrincetonCITP. Work on algorithm auditing, privacy & fairness. Past: @USCViterbi @Snap @Google @Stanford @MITeverett @typochondriac
1K Followers 897 Following brand creative director @anthropicai , previously @stripepress @stripeOrowa Sikder @OrowaSikder
1K Followers 304 Following the future could be amazing. let’s get to work | Research @AnthropicAI, ex: PhD @UCLCSDanny Halawi @dannyhalawi15
167 Followers 290 Following masters student at @berkeley_ai advised by @JacobSteinhardt. Interested in interpretability, scalable oversight, and forecasting.Shashwat Goel @ShashwatGoel7
190 Followers 268 Following Trustworthy ML | Science of Deep Learning | AI Safety Final year student @ IIIT HyderabadSasha de Marigny @sashadem
3K Followers 533 Following Not Australian | Thinkin’ about Claude @AnthropicAIEric Steinberger @EricSteinb
7K Followers 478 Following Writing code that writes code on a mission to build safe superintelligence | CEO/cofounder @magicailabsJay Shooster @JayShooster
3K Followers 2K Following Lawyer for consumers, animals, & the environment. @USJewishDems organizer. Gun sense advocacy w/ @momsdemand. Running to represent FL State House District 91.lmsys.org @lmsysorg
37K Followers 171 Following Large Model Systems Organization. We created Vicuna and Chatbot Arena! Compare 30+ LLMs (GPT-4/Claude/Llamas) side-by-side at https://t.co/IDFeIDIOtmC Thi Nguyen @add_hawk
26K Followers 2K Following Philosophy professor. Writes about games, trust, art, intimacy, echo chambers, metrics. My new book is GAMES: AGENCY AS ART: https://t.co/tFdq4LJygBUriah @crimkadid
15K Followers 45 FollowingJulie Kallini ✨ @JulieKallini
600 Followers 337 Following CS PhD @StanfordNLP 🌲 Previously: SWE @Meta, Class of '21 @PrincetonCSJoy He-Yueya @JoyHeYueya
71 Followers 68 Following CS PhD student working on AI for education @StanfordAILabPaul Röttger @paul_rottger
2K Followers 455 Following Postdoc @MilaNLProc, working on evaluating and improving LLM safety. Previously PhD @oiioxford & CTO/co-founder @rewire_onlineRobert Palgrave @Robert_Palgrave
7K Followers 1K Following Professor of Inorganic and Materials Chemistry at UCL. Director of UK National XPS Service @harwellxpsArc Institute @arcinstitute
22K Followers 24 Following A new scientific institution for curiosity-driven biomedical science and technology.Megan Stevenson @MeganTStevenson
7K Followers 984 Following economist & legal scholar studying criminal justice. UVA law prof. w/long covid. research: https://t.co/kkj5xtHIjgMaggie Appleton @Mappletons
37K Followers 1K Following Design @elicitorg. Makes visual essays about UX, programming, and anthropology. Adores digital gardening 🌱, end-user development, and embodied cognitionEric Gilliam @eric_is_weird
3K Followers 1K Following I write about how 20th C. R&D orgs operated and advise new R&D orgs @GoodSciProject | Formerly @Stanford I want to help people start historically great labsW. David Marx @wdavidmarx
11K Followers 644 Following Author of Status and Culture, Ametora, and an upcoming cultural history of the 21st century for Viking (Fall 2025). Newsletter at https://t.co/M0KE6eCmKM.Joan Donovan, PhD �.. @BostonJoan
45K Followers 5K Following Founder the Critical Internet Studies Institute & BU Asst Professor of Journalism. Whistleblower coverage from Washington Post: https://t.co/MSEz9RhVQnDeepak Narayanan @deepakn94
1K Followers 1K Following Research Scientist at @nvidia. Interested in the intersection of Computer Systems and ML. Occasionally tweet about sports. Views are my own.William Gilpin @wgilpin0
5K Followers 2K Following asst prof @UTAustin physics @OdenInstitute interested in chaos, fluids, & biophysics.Kenny Peng @kennylpeng
80 Followers 16 Following CS PhD student at Cornell Tech. Interested in interactions between algorithms and society. Princeton math '22.Ada Lovelace Institut.. @AdaLovelaceInst
23K Followers 2K Following Making data & AI work for people & society. Sign up for our fortnightly newsletter: https://t.co/lTk3R2LxwOAlex Beutel @alexbeutel
2K Followers 682 FollowingVatsal @vatsal_manot
3K Followers 891 Following Building @PreternaturalAI (YC W24). Maintainer of @SwiftUIX.Ian Carroll @iangcarroll
9K Followers 1K Following Founder at @SeatsAero. Travel/points, application security, security research, etc.Anshul Kundaje (anshu.. @anshulkundaje
22K Followers 2K Following Genomics, Machine Learning, Statistics, Big Data and Football (Soccer, GGMU). Post: @anshulkundaje, Threads: anshulkundajeI made this last weekend to experiment w/ building an app end to end on LLMs: vibecheck.market It's like Wirecutter, but uses an LLM to recommend product choices based on reddit conversations and reviews, so you don't have to spend 20-30min reading reddit My experience:…
Scaling laws for dictionary learning! transformer-circuits.pub/2024/april-upd…
Some small updates from the Anthropic Interpretability team: transformer-circuits.pub/2024/april-upd…
Some small updates from the Anthropic Interpretability team: transformer-circuits.pub/2024/april-upd…
@ChrisRytting Thanks for the tag, just getting around to this— the main alternative I can think of is in RL where the task may be specified by a reward function, goal state, policy, etc.
@ChrisRytting Natural language prompts can also take many forms: they are commonly a set of (potentially programmatic) instructions as you noted, but they may also be a description of the goal, a list of requirements, an interactive dialogue (as in my and @AlexTamkin ‘s recent GATE paper)
Our first Build with Claude contest was a success! We received tons of great submissions from @AnthropicAI devs. Here are the 5 winning projects (in no particular order)🧵
There are pieces of poetry, literature, and philosophy that I only come to appreciate after I experience something important and realize they encapsulate that experience. Until then, the thing just seemed kind of mid. I wonder how many gems are still hidden in the sea of mid art.
Once, we ran a study on Prolific and a participant wrote on Reddit that the study “Felt like I was losing the will to live.” I went on the Prolific Subreddit (24k members!) and asked what matters. Here is what they told me. A thread on happier participants and better studies 1/9
Related: the decentralized increase in power is stimulating a concomitant increase in surveillance (from traffic light cameras to surveillance of DNA synthesis). It's mostly pretty centralized, though, without any strong enablement of either sousveillance or (at the least)…
This result is pretty clearly specific to the style of backdoor we're working with, and doesn't support broad claims like 'interpretability solves misalignment', but it's still surprisingly strong. Worth a look!
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
There is a really nice community of researchers developing transformer alternatives. Want to highlight these impressive folks. Simran Arora (@simran_s_arora), Chunting Zhou (@violet_zct), Dan Fu (@realDanFu), and Songlin Yang (@SonglinYang4)
starting a thread to document some interface design explorations of mine: discover similar words in a particular semantic direction using word embeddings
imagining what a color picker for words could look like
To make the probes, we track how the model’s internal state changes between “Yes” vs “No” answers to questions like "Are you doing something dangerous?" We use this info to detect when a sleeper agent is about to misbehave (e.g. insert a code vulnerability). It works quite…
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
Constitutional AI showed LMs can learn to follow constitutions by labeling their own outputs. But why can't we just tell a base model the principles of desired behavior and rely on it to act appropriately? Introducing SAMI: Self-Supervised Alignment with Mutual Information!
The paper is already outdated given the release of more power models but there's an important empirical trend line to observe here. This portends the need for defenders to get patches out to every piece of infrastructure in days, not months.
LLM Agents can Autonomously Exploit One-day Vulnerabilities GPT-4 can autonomously exploit 87% of real-world one-day vulnerabilities, identified in a dataset of critical severity CVEs, compared to 0% for all other tested models arxiv.org/abs/2404.08144
A GitHub flaw lets attackers upload executables that appear to be hosted on a company's official repo, such as Microsoft's—without the repo owner knowing anything about it. The following URLs, for example, make it seem like these ZIPs are present on Microsoft's source code repo:…
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: github.com/google-deepmin…
There's been a lot of buzz about "emergent abilities" in large language models, including some media exaggeration. I took a crack at explaining the different perspectives. 🧵 cset.georgetown.edu/article/emerge…