nickcdryan @nickcdryan

nlp, deep learning, NYC nickcdryan.com New York, USA Joined May 2023

Tweets

472
Followers

245
Following

958
Likes

3K

David @DavidSHolz

2 weeks ago

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

76 75 1K 235K 623

View Details

Zachary Nado @zacharynado

4 weeks ago

the sloptimizer field is just getting started with shampoo and muon gen algorithms, the graveyard of adam variants got so bad you can't list them all on a page

Pranav Shyam @recurseparadox

4 weeks ago

Moratorium on new optimizers until we figure out whats going on

6 6 128 51K 8

19 37 316 43K 109

View Details

nickcdryan @nickcdryan

3 months ago

@srush_nlp It's nice that everything has a clean referent but bad that it's all informationally flat. They should be named as honestly as the wandb runs: GRPO-tweak1 GRPO-tweak2 GRPO-tweak2+tweak1

0 0 0 391 0

View Details

nickcdryan @nickcdryan

5 months ago

Low exploration costs improve craftsmanship. I can think of few trades where free, low-cost exploration doesn't lead to better knowledge and craftsmanship. Software design has been largely, rightly, guided by risk-aversion. A changing explore/exploit balance is a good thing.

Erik Meijer @headinthebox

5 months ago

Slop generators like Ralph, Loom and Gastown treat developers as if they are free resources, replacing costly humans with swarms of agents churning out specs or cloned patterns. My take is that the real disruption is not about free labor, but about the fact that changing your

49 44 447 50K 214

0 0 1 65 0

View Details

nickcdryan @nickcdryan

7 months ago

Even simpler: it's just a basic requirement because there isn't enough time or $ to gridsearch your idea. And if gridsearching your idea is the only way to make it work it's probably not worth it anyway.

yi @agihippo

7 months ago

ablations are for the weak. just yolo your runs. (ok, do some small amount of ablations, but don't over do it). instinct is everything in ML and AI.

6 3 145 95K 27

0 0 0 102 0

View Details

nickcdryan @nickcdryan

9 months ago

@jxmnop why do you think models won't close the gap on writing good kernels?

1 0 0 2K 0

View Details

nickcdryan @nickcdryan

9 months ago

@LysandreJik Agree with the sentiment - those all seemed like big, distant step changes in the early decoder days. And very cool viz. That said...the dates are broken for all those models.

1 0 0 86 0

View Details

nickcdryan @nickcdryan

9 months ago

Getting to the heart of the matter here, and fixing it. Heard many times batching is the culprit, but this is the first in-depth explanation I've seen.

Thinking Machines @thinkymachines

9 months ago

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

230 1K 8K 3.5M 5K

0 0 1 131 0

View Details

Brian Huang @brianryhuang

10 months ago

It's one thing to read about reward hacking in the literature and think it's not that bad, it's another thing to do some RL and see that reward hacking really got hands

3 2 56 5K 6

View Details

nickcdryan @nickcdryan

10 months ago

@zmkzmkz haha ok that makes sense. Either way, as the model size scales up the cost of that extra unembedding gets amortized away. Good to know for the smaller model comparisons though.

0 0 1 48 0

View Details

nickcdryan @nickcdryan

10 months ago

@zmkzmkz Nice, TOP uses fewer flops per step? I was thinking about the extra unembedding in TOP - do you cut out a layer somewhere to compensate?

1 0 0 70 0

View Details

nickcdryan @nickcdryan

10 months ago

@vikhyatk @omouamoua

1 0 1 264 0

View Details

nickcdryan @nickcdryan

10 months ago

@robinhanson An Atlantic article with some summaries and links out to the research. Interesting article, I don't see this specific interpretation in there though. archive.is/ha0nL

0 0 3 282 1

View Details

nickcdryan @nickcdryan

10 months ago

> score is calculated based on "how often words are used together" > just pick obscure words that don't get used at all > rejected

KC @amphichrome_

10 months ago

what’s the highest you guys got to on the divergent vocabulary test

595 54 5K 685K 2K

0 0 1 111 0

View Details

nickcdryan @nickcdryan

10 months ago

@xeophon Nice eval! Interesting framing though. Would you interpret the task as measuring a behavior, without opining on whether this behavior is good or bad? Or is it just good? I'd say there are clear use cases where this behavior is nice and where it is not nice.

0 0 1 111 0

View Details

nickcdryan @nickcdryan

10 months ago

TIL some people are still using ROUGE

elvis @omarsar0

10 months ago

The Illusion of Progress It's well known that there are caveats with benchmarks and metrics that measure LLM capabilities. It's no different for hallucination detection. "ROUGE fails to reliably capture true hallucination" Here are my notes:

10 47 217 28K 147

0 0 1 60 0

View Details

nickcdryan @nickcdryan

10 months ago

@immortal_0698 @natolambert Sure, but if you're not forced to use it then it's just a bad feature that you don't use.

1 0 0 15 0

View Details

nickcdryan @nickcdryan

10 months ago

@gatelice_ @ddkang Sure, and as you say there are scenarios where a datapoint is worth $100 or $1000 or $1 or $.01. I'm trying to get the authors to provide the evidence they used for their estimates.