Max Marion @maxdoesresearch

my machine learning research account where i tell you abt all my sick experiments | pfp: me w/ https://t.co/XWwMkEg1a1 | personal account: @maxisawesome538 maxisawesome.github.io San Francisco, CA Joined December 2022

Tweets

72
Followers

456
Following

97
Likes

181

Max Marion @maxdoesresearch

2 years ago

Zach is killing it in a paper that takes my previous work and expands on it considerably. Highly recommend reading it and following Zach!

Zack Ankner @ZackAnkner

2 years ago

New paper where we explore using a small LM’s perplexity to prune the pretraining data for larger LMs. We find that small LMs can prune data for up to 30x larger LMs, data pruning works in the overtrained and data-constrained regimes, and more! arxiv.org/abs/2405.20541

11 60 327 73K 253

1 2 19 7K 5

View Details

Jonathan Frankle @jefrankle

2 years ago

Fixed it for you, @code_star

Rajko Radovanović @rajko_rad

2 years ago

Incredible performance and efficiency, all Apache 2.0 open, from the amazing @MistralAI team!!! I’m most excited for the SOTA OSS function calling, code and math reasoning capabilities!! Cc @GuillaumeLample @tlacroix6 @dchaplot @mjmj1oo @sophiamyang

3 4 70 56K 10

4 8 89 54K 18

View Details

Andrew Canis @andrewcanis

2 years ago

I've added support for Command-R to llama.cpp! Command-R is an exciting new 35B model with 128k context length for RAG and Tool Use I also converted the model to GGUF format (F16, Q8, Q4, Q2) HF: huggingface.co/andrewcanis/c4… Release: github.com/ggerganov/llam… @cohere @francoisfleuret

3 16 86 9K 30

View Details

Max Marion @maxdoesresearch

2 years ago

Data Selection is in vogue

Alon Albalak @AlbalakAlon

2 years ago

{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍 Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress arxiv.org/abs/2402.16827

10 72 302 111K 264

1 0 3 930 0

View Details

Matei Zaharia @matei_zaharia

2 years ago

Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models. AlphaCode, ChatGPT+, Gemini are examples. In this post, we discuss why this is and emerging research on designing & optimizing such systems. bair.berkeley.edu/blog/2024/02/1…

29 255 1K 320K 819

View Details

Max Marion @maxdoesresearch

2 years ago

saw just how much work went into this and its nothing short of incredible. Grats to the whole team - its a huge milestone!

Sara Hooker @sarahookr

2 years ago

Today, I am very proud share what we have been working on for the last 14 months. ✨ Introducing Aya -- a new state-of-art for massively multilingual models. 🔥🎉

43 157 996 98K 190

1 0 15 1K 0

View Details

Ahmet Üstün @ahmetustun89

2 years ago

Thrilled to announce Aya 🌿, a massively multilingual instruction-tuned LLM, featuring 101 languages and the largest collection of multilingual instruction datasets. Over half of these languages are under-resourced. A monumental effort from @CohereForAI and Aya team 🚀

Cohere Labs @Cohere_Labs

2 years ago

Today, we’re launching Aya, a new open-source, massively multilingual LLM & dataset to help support under-represented languages. Aya outperforms existing open-source models and covers 101 different languages – more than double covered by previous models. cohere.com/research/aya

51 358 1K 701K 507

4 14 97 17K 10

View Details

Max ⛅ @maxisawesome538

2 years ago

just saw (Marion et al., 2023) in a paper for the first time 🥲

10 5 81 14K 1

View Details

Max Marion @maxdoesresearch

2 years ago

@hongjian_zou heya thanks! All models received the same number of training steps and used the same amount of compute regardless of the dataset pruning. If the dataset was pruned down to 50%, the model trained on that dataset saw each datapoint twice.

0 0 0 57 0

View Details

Max Marion @maxdoesresearch

3 years ago

Neurips was so much fun that I'm determined to come back with a paper next year 😤

Sara Hooker @sarahookr

3 years ago

🔥🎉 @maxdoesresearch presents “when less is more: investigating data pruning for pretraining LLMs at scale” Attrib Workshop 2023

1 2 46 11K 7

2 3 42 7K 2

View Details

Max Marion @maxdoesresearch

2 years ago

@__femb0t that's right (check my header)

0 0 0 199 1

View Details

Max Marion @maxdoesresearch

3 years ago

@sarahookr wow it's @AlbalakAlon 😍

1 0 2 265 0

View Details

Ksenia Se @Kseniase_

3 years ago

LLMs improved using available data from the noisy Internet. @CohereForAI researchers achieved unexpected results by pruning data. Their research suggests removing most pretraining data while maintaining performance!

1 11 73 12K 51

View Details

Cohere @cohere

3 years ago

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale cohere.com/research/paper… @maxdoesresearch @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr

1 2 11 2K 0

View Details

Cohere Labs @Cohere_Labs

3 years ago

In 2022, we Launched the Cohere For AI Scholars Program to help close the gap between research experience and opportunity. In our inaugural year, we welcomed 6 talented researchers - @luizapzbn, @lekeonilude, @maxdoesresearch, @aahmadian_, @tedzadouri and Meriem Boubdir.

2 4 26 3K 2

View Details

Max Marion @maxdoesresearch

3 years ago

@code_star @CohereForAI I pinky promise bro

0 0 1 42 0

View Details

Max Marion @maxdoesresearch

3 years ago

📢New Pretraining Paper 📢 Delighted to share our new paper coming out of @forai_ml : "When Less is More: Investigating Data Pruning for Pretaining LLMs at Scale" Paper: arxiv.org/abs/2309.04564 w/ @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr

8 24 84 29K 43

View Details

Sara Hooker @sarahookr

3 years ago

Really proud of our work led by @maxdoesresearch w @ahmetustun89 @luizapzbn @W4ngatang @mziizm 🎉 LM datasets are huge. Is all text needed? How can we measure data quality in this setting? Enter data pruning: removing subsets least valuable while preserving performance.

Cohere Labs @Cohere_Labs

3 years ago

What is “good data?”👩‍🔬 Our recent paper tackles this question via data pruning! We explore several metrics for measuring LLM pretraining data and finds that we can remove up to 70% of pretraining data while achieving better test set performance. 📜 arxiv.org/abs/2309.04564

4 37 156 66K 85

3 16 83 17K 22

View Details

Max Marion @maxdoesresearch

3 years ago

You're intuitions on the easy/hard data is on par with what we found - very easy data was often user agreements or text that would appear all over the internet, like at the bottom of a webpage. The harder subset is more complicated - some of it was nonsense, but some text, like medical or scientific text, can have high perplexity but could still useful for certain contexts. Selecting a good validation set would, ironically, be an excellent extension of this line of work 😂

0 0 0 72 0

View Details

Max Marion @maxdoesresearch

3 years ago

@EIFY @forai_ml @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr ...we found that you do need some training in the reference model to get a usable pruning signal. I think it would be a great next step!

0 0 0 41 0

View Details

Max Marion @maxdoesresearch

3 years ago

@EIFY @forai_ml @ahmetustun89 @luizapzbn @W4ngatang @mziizm @sarahookr Our EL2N experiments are a version of this, in that we use the same paramete/arch setup and use signals from those models as our pruning metric. The setup you mention is possible but was more complicated engineering wise for us. You would need do some gradient updates, as...