Max Marion @maxdoesresearch
my machine learning research account where i tell you abt all my sick experiments | pfp: me w/ https://t.co/XWwMkEg1a1 | personal account: @maxisawesome538 maxisawesome.github.io San Francisco, CA Joined December 2022-
Tweets72
-
Followers456
-
Following97
-
Likes181
Zach is killing it in a paper that takes my previous work and expands on it considerably. Highly recommend reading it and following Zach!
New paper where we explore using a small LM’s perplexity to prune the pretraining data for larger LMs. We find that small LMs can prune data for up to 30x larger LMs, data pruning works in the overtrained and data-constrained regimes, and more! arxiv.org/abs/2405.20541
Fixed it for you, @code_star
Incredible performance and efficiency, all Apache 2.0 open, from the amazing @MistralAI team!!! I’m most excited for the SOTA OSS function calling, code and math reasoning capabilities!! Cc @GuillaumeLample @tlacroix6 @dchaplot @mjmj1oo @sophiamyang
I've added support for Command-R to llama.cpp! Command-R is an exciting new 35B model with 128k context length for RAG and Tool Use I also converted the model to GGUF format (F16, Q8, Q4, Q2) HF: huggingface.co/andrewcanis/c4… Release: github.com/ggerganov/llam… @cohere @francoisfleuret
Data Selection is in vogue
{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍 Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress arxiv.org/abs/2402.16827
Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models. AlphaCode, ChatGPT+, Gemini are examples. In this post, we discuss why this is and emerging research on designing & optimizing such systems. bair.berkeley.edu/blog/2024/02/1…
saw just how much work went into this and its nothing short of incredible. Grats to the whole team - its a huge milestone!
Today, I am very proud share what we have been working on for the last 14 months. ✨ Introducing Aya -- a new state-of-art for massively multilingual models. 🔥🎉
Thrilled to announce Aya 🌿, a massively multilingual instruction-tuned LLM, featuring 101 languages and the largest collection of multilingual instruction datasets. Over half of these languages are under-resourced. A monumental effort from @CohereForAI and Aya team 🚀
Today, we’re launching Aya, a new open-source, massively multilingual LLM & dataset to help support under-represented languages. Aya outperforms existing open-source models and covers 101 different languages – more than double covered by previous models. cohere.com/research/aya
just saw (Marion et al., 2023) in a paper for the first time 🥲
@hongjian_zou heya thanks! All models received the same number of training steps and used the same amount of compute regardless of the dataset pruning. If the dataset was pruned down to 50%, the model trained on that dataset saw each datapoint twice.
Neurips was so much fun that I'm determined to come back with a paper next year 😤
🔥🎉 @maxdoesresearch presents “when less is more: investigating data pruning for pretraining LLMs at scale” Attrib Workshop 2023
@__femb0t that's right (check my header)
LLMs improved using available data from the noisy Internet. @CohereForAI researchers achieved unexpected results by pruning data. Their research suggests removing most pretraining data while maintaining performance!
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale cohere.com/research/paper… @maxdoesresearch @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr
In 2022, we Launched the Cohere For AI Scholars Program to help close the gap between research experience and opportunity. In our inaugural year, we welcomed 6 talented researchers - @luizapzbn, @lekeonilude, @maxdoesresearch, @aahmadian_, @tedzadouri and Meriem Boubdir.
@code_star @CohereForAI I pinky promise bro
📢New Pretraining Paper 📢 Delighted to share our new paper coming out of @forai_ml : "When Less is More: Investigating Data Pruning for Pretaining LLMs at Scale" Paper: arxiv.org/abs/2309.04564 w/ @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr
Really proud of our work led by @maxdoesresearch w @ahmetustun89 @luizapzbn @W4ngatang @mziizm 🎉 LM datasets are huge. Is all text needed? How can we measure data quality in this setting? Enter data pruning: removing subsets least valuable while preserving performance.
What is “good data?”👩🔬 Our recent paper tackles this question via data pruning! We explore several metrics for measuring LLM pretraining data and finds that we can remove up to 70% of pretraining data while achieving better test set performance. 📜 arxiv.org/abs/2309.04564
You're intuitions on the easy/hard data is on par with what we found - very easy data was often user agreements or text that would appear all over the internet, like at the bottom of a webpage. The harder subset is more complicated - some of it was nonsense, but some text, like medical or scientific text, can have high perplexity but could still useful for certain contexts. Selecting a good validation set would, ironically, be an excellent extension of this line of work 😂
@EIFY @forai_ml @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr ...we found that you do need some training in the reference model to get a usable pruning signal. I think it would be a great next step!
@EIFY @forai_ml @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr Our EL2N experiments are a version of this, in that we use the same paramete/arch setup and use signals from those models as our pruning metric. The setup you mention is possible but was more complicated engineering wise for us. You would need do some gradient updates, as...
Katie-Rose Skelly @KR_Skelly
280 Followers 274 Following Building something new. Cofounder and CTO @known_med (acq by Pathos), ex-@RecursionPharma, Biomedical Informatics + CS at Stanford, dessert enthusiast.
Andreas @andrebloom
34 Followers 270 Following
lth @lottichyp
9 Followers 67 Following Alt. Disclaimer: Personal account. Does not reflect my employer's views on neural network sparsity (or any other topic).
Sriram Eleswarapu @eleswarapu
1K Followers 4K Following Men's Health, Reproductive Urology & Bioengineering @UCLAUrology. Views my own.
Davis Treybig @TreybigDavis
1K Followers 5K Following Early stage investor at Innovation Endeavors, focused on computing infrastructure, data/AI, and tools for builders.
Arpit mohapatra @moharpit
60 Followers 4K Following
Karthik Sundar @stark_reborn
21 Followers 192 Following student, research in Dataset Distillation, Privacy and World Models 🇬🇧
Olergex @Olergex2369
212 Followers 7K Following Elegance is the only beauty that never fades. — Audrey Hepburn
keshav baniya @kingership321
1 Followers 114 Following
Amanda Contreras @AmandaCont745
1 Followers 91 Following Recruiter at Google, hiring AI and ML talent.
Jing Guo @guojing0
271 Followers 3K Following MSc in Math from @uni_regensburg; BSc in Math and CS minor from @UUtah AI4Math & ML/DL & Extremal comb, TCS, and number theory
Reza (Rey) Sanayei @r_sanayei
39 Followers 916 Following ML Research @ScaleAILabs | MSCS @Uarizona | NLP @LabCLU
Sanka Mohottala @Danny__SM
155 Followers 2K Following Lecturer @ University of Sri Jayewardenepura || DL Theory, GNN, Human Action Recognition, Network Science || British Humor || Rowing || Humanist
GruSome @DeepBNN
19 Followers 533 Following
Hero Thousandfaces @1thousandfaces_
11K Followers 2K Following virtual star embryologist; stochastic parrot
Annie Hu @AnnieGraceHu
132 Followers 336 Following @AnthropicAI | Developer of the Daily Set Game (https://t.co/MvEEnWMfDM)
Lwandlolubanzi T Ndeb... @lwandlendebele
2K Followers 7K Following Yashua Emmanuel Jesus Christ is Lord of ALL ! Human Rights Activist AI and Inference , Robotics , Aerospace, Quantum Computing Medoola Tecknologia
Shrinidhi Mahesh @shrinimahesh
28 Followers 771 Following she/her | Looking for full-time Machine Learning Engineer / Research Engineer/ Data Engineer roles from May 2026 | Currently MSEE (ML) @USC
emirovic 🇹🇷🍉 @emrshn13
253 Followers 2K Following psych & cs @McGillU well well well, how the turntables @mcgillailab @theistchronicle
Matthew Alp @MatthewAlp
253 Followers 2K Following 🇨🇦 | Curious about compilers, databases, (noisy) systems of all sorts. Follows are footnotes, not endorsements
Iniyan @heyiniyan
294 Followers 3K Following Building the capability data layer for Indian manufacturing. Thinking in systems, data, and supply chains. Co-founder @Menerixofficial
Ill add something soo... @floreslgabriela
3 Followers 797 Following
Sujal @Sujaltwts
15 Followers 889 Following
Harshit Juneja @impractialdev
362 Followers 6K Following Understanding people and instructing machines. Thoughts and tweets are not mine or my employer's or my neighbour's. Probably they are yours.
mark @thisisnotmark_
549 Followers 2K Following AI/ML Research Lead @ NASA | Founder @ Mosaic Voice AAC. Views my own not my employer’s.
Sabuhi A. @SAbbaszad70856
21 Followers 6K Following
Tasha @TashaPais
3K Followers 3K Following prev rl @softmaxresearch, comp sci @RutgersU @Columbia, robot learning @CAIRLab, co founded @heypocket
Kron @KronicPoseidon
12 Followers 3K Following
LillianFrances @V7mtf3k68bm8S
177 Followers 6K Following Forever chasing serotonin & good WiFi signals 📶💻✨
alex peysakhovich @alex_peys
6K Followers 809 Following partner @shv - interested in ai for biology, also dogs, motorsports, multi-agent systems, and rl
Adam Bali @adam_bal1
147 Followers 1K Following Research Scientist / MLE (ex-Meta) | ML, NLP, Maths | 🏳️🌈
Angelfire Bog @deep_couch
367 Followers 543 Following garbage simulator; the most wretched crab in the bucket.
Alberto Hojel @AlbyHojel
6K Followers 4K Following dreams @roblox // uc berkeley ‘24 // YC w25. views my own
Ricardo Monti @RicardoMonti9
559 Followers 2K Following Previously @datologyai, CTRL-labs/META, @GatsbyUCL, @Imperial_Stats. Frequently on @caltrain. @pratyushmaini fan (one of many)
Sean Kulinski @seankski
185 Followers 229 Following researcher @DBRXMosaicAI - focusing on RL for improving agents. Ex @MSFTResearch and @Livermore_Lab. Ph.D. @PurdueECE
Matei Zaharia @matei_zaharia
49K Followers 1K Following CTO @Databricks and prof @UCBerkeley. Working on data + AI, @ApacheSpark, @DeltaLakeOSS, @MLflow, @DSPyOSS, @GEPA_ai. https://t.co/nmRYAKG0LZ
John Dang @johnamqdang
1K Followers 1K Following AI Researcher, Founding Team at @adaption_ai | Prev @Cohere_Labs @Cohere | LLM Post-Training, RL, Reasoning, Multimodality, Multilinguality
Ishan Khatri @ CVPR @i_ikhatri
894 Followers 863 Following Perception for "embodied AI" at StackAV. Visiting Researcher @CMU_robotics. Formerly @motionaldrive @argoai. Opinions are my own.
Justin Kay @__justinkay
1K Followers 2K Following PhD student @MIT. Co-founder & CTO https://t.co/1LGXaee5ui. https://t.co/yBlEyqXEOa
Guilherme Penedo @gui_penedo
4K Followers 2K Following Co-founder & CEO @macrodata_labs | Formerly pre-training data @huggingface 🤗. Lisboeta 🇵🇹
Rong Ching Chang @AnnCC12
845 Followers 8K Following Fascinated by human-multi agent interaction. CS Ph.D. in progress @ucdavis
Wei-Yin Ko @weiyinko_ml
286 Followers 217 Following
Saurabh Shah @saurabh_shah2
4K Followers 2K Following human-ing & AI-ing @humansand prev @allen_ai @Apple @Penn 🎤dabbler of things🎸 🐈⬛enjoyer of cats 🐈 and mountains🏔️he/him
Pradyumna (in Bay Are... @PradyuPrasad
11K Followers 3K Following Abundance mindset enjoyer. Evals @ @elicitorg Latest blog post: https://t.co/c8Tau6OBiD Follow for tweets on AI progress and economic growth
❤️🔥 xiq @exgenesis
5K Followers 2K Following epistemics https://t.co/iBy34G03Pi https://t.co/b3L7uD5lNz https://t.co/1ZCaxKMOgA
Saksham @sgdescent
1K Followers 2K Following Interested in making LLMs go brrrrr x+N: @datologyai and @openai x: @LTIatCMU x-N: https://t.co/ht5ObQh7RV & Program Synthesis with LLMs @ProseMsft
Cosmin Negruseri @cosminnegruseri
4K Followers 4K Following ex Pinterest Search / Homefeed, https://t.co/0VwMvjB9Xh, Altiscale, Google Ads, Search, Google Code Jam organizer
Meriem @mellem_boo
85 Followers 115 Following
Steven Ndung'u, PhD @stevenndungu_
188 Followers 2K Following Data Scientist || Artificial Intelligence || Fintech || Machine Learning || Astronomy || GIS and Remote Sensing
Andreas Kirsch 🇺�... @BlackHC
16K Followers 7K Following My opinions only here. 👨🔬 RS DeepMind 1.8y, Midjourney 1y 🧑🎓 DPhil AIMS 4.5y 🧙♂️ RE DeepMind 1y 📺 SWE Google 3y 🎓 TUM 👤 @nwspk
Junyuan "Jason" Hong @hjy836
1K Followers 3K Following Incoming AP @NUSingapore ECE. Now @MassGenBrigham, prior @VITAGroupUT @MLFoundations . MLSys Rising Star 2024. Interests: Responsible AI, Cognitive Health.
Yong Zheng-Xin @yong_zhengxin
2K Followers 2K Following preparedness (astra fellow) || phd in em-dashes @BrownCSDept || did some research @Cohere_Labs @MetaAI || views my own
Ted Zadouri @tedzadouri
1K Followers 332 Following PhD Student @PrincetonCS @togethercompute | Previously: @cohere @UCLA
Jascha Sohl-Dickstein @jaschasd
30K Followers 816 Following Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.
Alec Radford @AlecRad
71K Followers 303 Following
Ahmet Üstün @ahmetustun89
2K Followers 674 Following Code Agents Lead @cohere. Previously Research Scientist @Cohere_Labs, @GroNlp, @naverlabseurope.
Marzieh Fadaee @mziizm
2K Followers 796 Following exploring the longitude problem of AI. Head of @Cohere_Labs. PhD from @UvA_Amsterdam. https://t.co/YI5NC5J5e4. زن، زندگی، آزادی
Esra'a Saleh @TheEsraaSaleh
790 Followers 3K Following AI / ML / RL research @Mila_Quebec / @UMontreal, prev. research @Ualberta, @AmiiThinks, @rlai_lab. Open science community lead @Cohere_Labs .
Bhavnick Minhas @minhash
1K Followers 972 Following Subagent Spawner @ChonkieAI (YC X25), Barista @better_auth, @IITGuwahati Alum, Ex community lead @Cohere_Labs 🩵
Arash Ahmadian @aahmadian_
2K Followers 727 Following Research Scientist @GoogleDeepmind, Gemini RL & post-training, Gemini 3. prev: @Cohere @CohereForAI
Alon Albalak @AlbalakAlon
2K Followers 622 Following Open-endedness, Data-centric AI @LilaSciences Previously: RS @synth_labs, PhD @ucsbNLP, Internships @AIatMeta @MSFTResearch All views are my own
Srishti Gureja @srishti_gureja
2K Followers 424 Following lead applied research eng // technical ai safety research @ spar // prev - cohere thinking about ml systems and safety
Luiza Pozzobon @luizapzbn
465 Followers 324 Following PhD student @uwcse @uwnlp | prev scholar @CohereForAI | MSc @ Unicamp, Brazil
Vipul Gupta @vipul_1011
3K Followers 1K Following Research Scientist @Scale_AI. Past: PhD @Penn_State, FAIR @AIatMeta, @IITDelhi. Interested in model evaluation and AI Safety. I don’t hallucinate
niki parmar @nikiparmar09
17K Followers 943 Following Working @Anthropic. Views expressed here are my own.
Ashish Vaswani @ashVaswani
31K Followers 2K Following
Leo Gao @nabla_theta
13K Followers 580 Following working on AGI alignment. prev: GPT-Neo, the Pile, LM evals, RL overoptimization, scaling SAEs to GPT-4, interp via circuit sparsity. EleutherAI cofounder.
Character.AI @character_ai
149K Followers 13 Following Unleash your imagination—chat, create, explore.
Stanislas Polu @spolu
25K Followers 641 Following co-founder+engineer(https://t.co/SXBR0l9TrF); alumni(https://t.co/z6zJ8xaKGI, https://t.co/CvVTA1CHAo, https://t.co/WOVEe2aLcK, https://t.co/ui9I4Nj7o1);
Amanda Dsouza @amanda_dsouza
237 Followers 501 Following Applied research scientist @SnorkelAI. Previous: @heyjasperai, @fractalAI. MS (ML) @gtcomputing.
Roi Cohen @roicohen9
132 Followers 122 Following Master’s student at @TelAvivUni. Working on @NLProc https://t.co/3MKU7afk5V
Gabriel Peyré @gabrielpeyre
100K Followers 446 Following @CNRS researcher at @ENS_ULM. One tweet a day on computational mathematics.
Pablo Samuel Castro @pcastr
14K Followers 835 Following Señor swesearcher @ Google DeepMind. Adjunct prof @ U de Montreal & Mila. Musician. From 🇪🇨 living in 🇨🇦.
Jay Alammar @JayAlammar
50K Followers 1K Following Machine Learning Researcher and writer https://t.co/5GlbofAHs0. O'Reilly Author https://t.co/Fl3uPAZHLg. LLM Builder @Cohere.

































