@maharshii what is your workflow for writing kernels? i.e. if you wanted to write a GDN kernel and you didn't knew anything about GDN, how would you start? and also what excites you about writing kernels?
I see a lot of enthusiasm about building sovereign models on my timeline.
That's great to hear and India needs it, BUT.. building a Fable-class model is a compute and funding game. Last I checked, India had ~50-100k H100 equivalents while frontier labs would have a million each.
Unless we have a paradigm shift in how AIs are trained, the conversation ought to be happening about amount of funding available to do what we want to do. Show me an Indian company that's secured funding/compute in the same range as that of Chinese AI labs (let alone American labs).
Without compute, what will happen is what has happened before: we'd promise to shake the world and then build models that are a year or two behind the top ones.
The path forward for sovereign models that I see is to invest in basic R&D so we have a chance to go beyond the current paradigm, OR the government pooling in several orders of magnitude more compute to seriously commit competing at par.
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees.
The net effect of
People should get smarter at a rate sufficient to integrate their old experiences, but not so much smarter so fast that they can't integrate their new intelligence. Being smarter means you get bored faster, but you can also tackle new challenges you couldn't understand before.
Register files used to be the ultimate bottleneck for Tensor Core accumulators. Introducing Blackwell’s Tensor Memory (TMEM), a completely new address space inside the SM that isolates the accumulator entirely from the register file.
"What I cannot create, I do not understand."
Introducing: The Feynman GPU Lectures.
Your H100s and B200s are running at a fraction of their peak utilization because your custom kernels are written with massive hardware bottlenecks. If you don't know what tcgen05. mma does at the wire level, you're lighting compute efficiency on fire.
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
The potential speed improvement vs JAX for large training runs is
778 Followers 954 FollowingTrying to figure out how AI works 🔍🧠
Currently at @ETH Zurich, previously @EPFL 🇨🇭
LLMs, interpretability, emergence, grokking 🤖
5K Followers 4K FollowingDell + AMD Exclusive AI GPU Cloud
.
3 years early choosing AMD
.
Fully automated platform
.
Need capital to continue to innovate
.
[email protected]
2K Followers 127 Followingmeetup about databases, distributed systems, dataflow systems, and compilers, in Bengaluru!
CfP for Meetup @ Zerodha on April 25th is now open!
Link in the bio!
1.4M Followers 2 FollowingWe're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant @claudeai on https://t.co/FhDI3KQh0n.
16K Followers 199 FollowingLarge Model Systems Organization: Join our Slack: https://t.co/vzYOTP4w6C. We developed SGLang https://t.co/OjwQadINKU, Chatbot Arena (now @arena), and Vicuna!
2K Followers 290 FollowingProfessional bit shover-arounder.
Does unspeakable things with SIMD units.
Computers aren't smart, they are just dumb faster.