Seroze @stringray3

- Obsession is far greater than Discipline Joined May 2018

Tweets

3K
Followers

116
Following

3K
Likes

30K

romit @romitjain_

a week ago

Sharing some learning from attending MLSys '26. There were a lot of interesting papers presented in distributed training and inference. Overall, I could capture the following themes: 1. Distributed training has a lot of knobs, which are really tough to manage and tune. Ton of work is being done to make it easy to manage this. 2. As training gets larger, reliability matters more. It was not surprising to see many industry talks focus on training reliability. 3. Ultra-long context lengths are getting a huge mindshare for both training and inference. 4. Heterogeneous compute (multi-region, multi-accelerator) is on the rise and is probably the next frontier of inference optimization. 5. Distributed inference still needs better auto-tuning for finding the best configs at large scale. 6. KV cache optimization, attention optimization, and quantization were already on the radar, so the number of papers on these topics was not a surprise. 7. IMO, from a skills perspective, the best thing to learn is GPU communication and networking. Learn everything around inter and intra-rack communication, NCCL, and UCCL. Lots of improvements in the coming years will come from optimizing communication between GPUs via better kernels and frameworks. For a list of interesting papers and their summaries: r0m1t.com/learnings-from…

4 31 291 15K 326

View Details

yuwen lu @yuwen_lu_

a week ago

看了一半，我靠怎么从来没人告诉我rl这么好玩

Sanbu 散步 @sanbuphy

a month ago

花了段时间写了 RL 教程 Hands-On Modern RL，路线是从 CartPole + PPO 入门，然后到 LLM 后训练（RLHF、DPO、GRPO）、Agentic RL。代码先行，公式用来解释现象。英文版很快更新。目前是草稿版本，RLHF、Agentic RL 部分本地审校中。欢迎提 PR 或 Issue & 显卡支持：github.com/walkinglabs/ha…

51 343 2K 330K 3K

41 66 679 141K 896

View Details

Chuck Tang @j316chuck

a week ago

Had such a blast working with @erictang000 , @charlie_ruan, @sumanthhegde, and @pcmoritz on enabling multi-LoRA RL training in SkyRL! We observed ~3x higher experiments throughput in comparison to running experiments in the traditional single-tenant fashion. One of my favorite parts of this collaboration is that all this code is open source so you can play with it yourself :) Here's the technical deep dive 🧵

Trajectory @trajectorylabs

a week ago

🏹5 Days of Trajectory. Day 3 - An Open Source Training Stack for Continual Learning Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are

11 62 511 93K 395

14 20 180 30K 142

View Details

Vuk Rosić 武克 @VukRosic99

a week ago

I broke my record on the LLM I'm training by switching from constant learning rate to warm-up + decay (scheduled) learning rate. Learning Rate Scheduling - Beginner Tutorial + Record LLM Speedrun Full video - youtu.be/PvLWxUobSoo At the start of training, the weights are far from optimal, so we use a high learning rate to make large updates and learn quickly. Later, as the model approaches a good solution, we reduce the learning rate to make smaller, more precise adjustments and avoid overshooting the optimum. Speedrun our LLM - github.com/vukrosic/unive… Become AI researcher: skool.com/become-ai-rese… (funds our compute)

0 1 9 384 2

View Details

Jaydev @JaydevTonde

a week ago

One more checkmark and I feel more excited about the upcoming work in building my LLM inference server. So far, it feels great implementing core techniques like separate prefill and decode, KV cache, prefill caching, etc. The upcoming things are more interesting: Torch Compile, CUDA Graphs, SD, Quantization, and Distributed inference. Since I know these theoretically, implementing them one by one will be fun. I recently completed my study on prefix caching, which involves block hash-based and radix tree-based approaches. I have also run some benchmarks with vLLM and SGLang and will make them public soon.

1 2 14 573 4

View Details

Deep-ML @real_deep_ml

a week ago

We just launched a new project that teaches you how to build Flash Attention with CUDA, step by step. By the end, you’ll have a working Flash Attention kernel built from the ground up. The project covers: -CUDA primitives warm-up -Matrix operations -Naive attention baseline -Online softmax math -Tiled attention building blocks -Fused Flash Attention kernel -Causal Flash Attention It will be open to everyone for the first 2 weeks, then it will become part of our premium projects.

18 108 1K 47K 1K

View Details

Seroze @stringray3

2 weeks ago

github.com/seroze/leetgpu… working on learning cuda

0 0 0 3 0

View Details

Jim Huang @jserv

4 weeks ago

I released MazuNIX on Mazu’s birthday. Unlike many educational operating systems that avoid SMP and real-time (RT) topics, Mazu delivers SMP, multicore RT scheduling, and practical POSIX Threads support. Full source code is available: github.com/MazuNIX/mazu

1 13 41 2K 6

View Details

Seroze @stringray3

a month ago

github.com/databases-sero… Btree sql engine implementation in python

0 0 0 22 0

View Details

Seroze @stringray3

a month ago

github.com/databases-sero…

0 0 0 15 0

View Details

andy @1a1n1d1y

2 months ago

x.com/i/article/2037…

49 69 827 108K 865

View Details

Gabriele Berton @gabriberton

2 months ago

I have joined @GoogleDeepMind! I'll be training VLMs And I'll still keep posting about latest developments on AI, Computer Vision and LLMs So no more posts on PyTorch tricks. I might post about JAX. Stay tuned...

122 66 4K 146K 488

View Details

0xSero @0xSero

3 months ago

In 72 hours I got over 100k of value 1. Lambda gave me 5000$ credits in compute 2. Nvidia offered me 8x H100s on the cloud (20$/h) idk for how long but assuming 2 weeks that'd be 5000$~ 3. TNG technology offered me 2 weeks of B200s which is something like 12000$ in compute 4. A kind person offered me 100k in GCP credits (enough to train a 27B if you do it right) 5. Framework offered to mail me a desktop computer 6. We got 14,000$ in donations which will go to buying 2x RTX Pro 6000s (bringing me up to 384GB VRAM) 7. I got over 6M impressions which based on my RPM would be 1500$ over my 500$~ usual per pay period 8. I have gained 17,000~ followers, over doubling my follower count 9. 17 subscribers on X + 700 on youtube. The total value of all this approaches at minimum 50,000$~ and closer to 150,000$ if I leverage it all. --------------------- What I'll be doing with all this: Eric is an incredibly driven researcher I have been bouncing ideas off of over the last month. Him and I have been tackling the idea of getting massive models to fit on relatively cheap memory. The idea is taking advantage of different forms of memory, in combination with expert saliency scoring, to offload specific expert groupings to different memory tiers. For the MoEs I've tested over my entire AI session history about 37.5% of the model is responsible for 95% of token routing. So we can offload 62.5% of an LLM onto SSD/NVMe/CPU/Cheap VRAM this should theoretically result in minimal latency added if we can select the right experts. We can combine this with paged swapping to further accelerate the prompt processing, if done right we are looking at very very decent performance for massive unquantisation & unpruned LLMs. You can get DeepSeek-v3.2-speciale at full intelligence with decent tokens/s as long as you have enough vram to host the core 20-40% of the model and enough ram or SSD to host the rest. Add quantisation to the mix and you can basically have decent speeds and intelligence with just 5-10% of the model's size in vram (+ you need some for context) The funds will be used to push this to it's limits. ----------------- There's also tons of research that you can quantise a model drastically, then distill from the original BF16 or make a LoRA to align it back to the original mostly. This will be added to the pipeline too. ------------------ All this will be built out here: github.com/0xSero/moe-com… you will be able to take any MoE and shove it in here, and with only 24GB and enough RAM/NVMe to compress it down. it'll be slow as hell but it will work with little tinkering. ------------------ Lastly I will be looking into either a full training run from scratch -> or just post-training on an open AMERICAN base model - a research model - an openclaw/nanoclaw/hermes model - a browser-use model To prove that this can be done. -------------------- I will be bad at all of it, and doubt I will get beyond the best small models from 6 months ago, but I want to prove it's no boogeyman impossible task to everyone who says otherwise. -------------------- By the end of the year: 1. I will have 1 model I trained in some capacity be on the top 5 at either pinchbench, browseruse, or research. 2. My github will have a master repo which combines all my work into reusable generalised scripts to help you do that same. 3. The largest public comparative dataset for all MoE quantisations, prunes, benchmarks, costs, hardware requirements. -------------------------- A lot of this will be lead by Eric, who I will tag in the next post. I want to say thank you to everyone who has supported me, I have gotten a lot of comments stating: 1. I'm crazy, stupid, or both 2. I'm wasting my time, no one cares about this 3. This is not a real issue I believe the amount of interest and support I've received says it all. donate.sybilsolutions.ai