Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
the sloptimizer field is just getting started with shampoo and muon gen algorithms, the graveyard of adam variants got so bad you can't list them all on a page
@srush_nlp It's nice that everything has a clean referent but bad that it's all informationally flat.
They should be named as honestly as the wandb runs:
GRPO-tweak1
GRPO-tweak2
GRPO-tweak2+tweak1
Low exploration costs improve craftsmanship.
I can think of few trades where free, low-cost exploration doesn't lead to better knowledge and craftsmanship.
Software design has been largely, rightly, guided by risk-aversion. A changing explore/exploit balance is a good thing.
Slop generators like Ralph, Loom and Gastown treat developers as if they are free resources, replacing costly humans with swarms of agents churning out specs or cloned patterns.
My take is that the real disruption is not about free labor, but about the fact that changing your
Even simpler: it's just a basic requirement because there isn't enough time or $ to gridsearch your idea.
And if gridsearching your idea is the only way to make it work it's probably not worth it anyway.
@LysandreJik Agree with the sentiment - those all seemed like big, distant step changes in the early decoder days. And very cool viz.
That said...the dates are broken for all those models.
Getting to the heart of the matter here, and fixing it.
Heard many times batching is the culprit, but this is the first in-depth explanation I've seen.
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”
We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
It's one thing to read about reward hacking in the literature and think it's not that bad, it's another thing to do some RL and see that reward hacking really got hands
@zmkzmkz haha ok that makes sense.
Either way, as the model size scales up the cost of that extra unembedding gets amortized away. Good to know for the smaller model comparisons though.
@robinhanson An Atlantic article with some summaries and links out to the research.
Interesting article, I don't see this specific interpretation in there though.
archive.is/ha0nL
@xeophon Nice eval! Interesting framing though.
Would you interpret the task as measuring a behavior, without opining on whether this behavior is good or bad? Or is it just good?
I'd say there are clear use cases where this behavior is nice and where it is not nice.
The Illusion of Progress
It's well known that there are caveats with benchmarks and metrics that measure LLM capabilities.
It's no different for hallucination detection.
"ROUGE fails to reliably capture true hallucination"
Here are my notes:
@gatelice_@ddkang Sure, and as you say there are scenarios where a datapoint is worth $100 or $1000 or $1 or $.01.
I'm trying to get the authors to provide the evidence they used for their estimates.
46 Followers 138 FollowingDS @nytimes // prev @meta // M.S. @BerkeleyISchool // 1st gen // occasional commentary on economics and mental health // he/they 🏳️🌈
216K Followers 301 FollowingA little bit geek, wonk, and nerd. Repeat entrepreneur, recovering lawyer, and former ski instructor. Co-founder & CEO of Cloudflare (NYSE: NET).
6K Followers 423 FollowingAssistant Professor of Computing Science @SFU. Ph.D. from @Berkeley_EECS and Bachelor's from @UofTCompSci. Formerly @GoogleAI and Member of @the_IAS.
94K Followers 679 Followingdirector of science programming for the Abstract Noun Abuse Prevention Task Force, a project of the Union of Concerned Anthropomorphic Fruit
41K Followers 14 FollowingI've been in the industry for O(40) years and have written O(1M) LOC. I don't think I'll ever write O(another) line again, but I'll be launching more than ever.
1K Followers 123 FollowingML PhD advised by @_albertgu at @mldcmu
Previously: CS & Physics at @MIT. IPhO 2019 silver.
Information compression and ARC-AGI
3K Followers 289 FollowingMaximizing throughput at @poolsideai
Educating people about GPUs at https://t.co/81rRJ4KoUt
I like my tea green and my compute parallel
12K Followers 3K FollowingSomething new. Previously: AI research at TBD Labs / Meta; cofounder at @AdeptAILabs; Invented Scratchpad / Chain-of-Thought; Google Brain
11K Followers 301 FollowingMember of Technical Staff
Co-founder at @CobaltRobotics
Co-founder at Posmetrics (acquired)
GoogleX, @SpaceX, @Harvard EE '15
21K Followers 4K FollowingLeading energy strategy and market development in Texas @Google. Powering breakthroughs and innovation at scale. Opinions stated here are my own. RT≠endorse