Alex Peng @alexpeng

posttraining @xai, formerly @cognition, @windsurf, @stanford alexpeng.me San Francisco, CA Joined August 2009

Tweets

92
Followers

529
Following

524
Likes

414

Alex Peng @alexpeng

2 weeks ago

@ShashwatGoel7 @dwarkesh_sp Have you considered using FutureSim/evals like it where you predict unseen events as a training data pipeline? ie RLing on rejection sampled traces where you filter for rollouts with the most surprise/miscalibration, then reevaluate with fresh events

1 0 3 112 0

View Details

Alex Peng @alexpeng

3 weeks ago

@ShashwatGoel7 This is such a cool benchmark! Great work

1 0 3 174 0

View Details

Alex Peng @alexpeng

3 weeks ago

I worked on this! It was a true end to end effort experimenting with data mixes, training at scale, and identifying and climbing the right hills to create this model

Yiwen Yuan @yiwenyuan98

3 weeks ago

Proud to see Grok 4.3 doing well on @Designarena’s Agentic Slides Arena. Surprising to see the power of a small model! I co-lead our Office Agent effort for Grok4.3, and it’s been lots of fun building this with the team.

22 14 285 30K 22

4 3 119 7K 4

View Details

Alex Peng @alexpeng

4 weeks ago

@stephenx_ So sick

0 0 6 498 0

View Details

Alex Peng @alexpeng

4 weeks ago

@matthieuschulz @deviationcap @windsurf @cognition Congrats @matthieuschulz!

1 0 1 61 0

View Details

Alex Peng @alexpeng

a month ago

@chenningli1117 🐐

0 0 2 46 0

View Details

Alex Peng @alexpeng

a month ago

So grateful for the chance to contribute and learn from the team across posttraining, data, and evals for this model release. More to come!

Artificial Analysis @ArtificialAnlys

a month ago

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20 The release of Grok 4.3 places @xai just above Muse Spark and Claude Sonnet 4.6 on the

130 248 2K 2.1M 372

0 1 5 378 0

View Details

Alex Peng @alexpeng

a month ago

@si_pbc @sonyatweetybird @MikowaiA @YasminRazavi @karpathy @tszzl @_milankovac_ Congrats on the raise guys!

0 0 0 170 0

View Details

Elon Musk @elonmusk

3 months ago

@UziObi Coming soon

1K 478 20K 904K 292

View Details

OpenAI Developers @OpenAIDevs

3 months ago

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

95 131 1K 241K 395

View Details

david rein @idavidrein

3 months ago

Seems like a lot of people are taking this as gospel—when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we're using here was just a tiny bit different, we could've measured a time horizon of 8 hours, or 20 hours.

METR @METR_Evals

3 months ago

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

227 451 4K 3.5M 1K

33 54 634 72K 82

View Details

Alex Peng @alexpeng

4 months ago

Ad Astra 🚀

SpaceX @SpaceX

4 months ago

SpaceX has acquired xAI, forming one of the most ambitious, vertically integrated innovation engines on (and off) Earth → spacex.com/updates#xai-jo…

4K 8K 45K 19.3M 3K

0 0 1 319 0

View Details

Alex Peng @alexpeng

4 months ago

cool :)

swyx @swyx

4 months ago

so after 24 hours we tallied early returns (from people koding on Saturdays mind you): @xai Grok is currently #3 coding model in the world by early voters (after 1 day and thousands of full agent votes). its really interesting to see the order shaken up, and there’s a reason

23 7 129 26K 60

1 0 3 3K 3

View Details

Alex Peng @alexpeng

4 months ago

@TheGregYang Take care Greg - office won’t be the same without being the LED sign guy every time candidates come around ❤️

0 0 3 673 0

View Details

Alex Peng @alexpeng

5 months ago

Doesn’t seem beneficial for frontier labs to facilitate their own commoditization by improving along this dimension too (see: opus 4.5 in CC vibes vs in other harnesses)

1 0 3 328 1

View Details

Alex Peng @alexpeng

5 months ago

How are people thinking about harness/scaffold resilient coding models? Seems implausible to me that it will be important in the long term as labs develop their own harnesses and buyers consolidate around those products (codex, claude code etc).

1 0 8 1K 1

View Details

Alex Peng @alexpeng

5 months ago

Minimax casually dropping that they’re training world models that solve the halting problem to make better coding models

MiniMax (official) @MiniMax_AI

5 months ago

x.com/i/article/2007…

8 39 395 493K 235

0 0 0 347 0

View Details

Alex Peng @alexpeng

5 months ago

I led eval process + deployment for frontier coding agents at multiple F500 companies at @windsurf then @cognition The only eval enterprises care about is if the deployment saves them money (time/headcount) Upstream of that eval metrics are extremely rudimentary. mostly still just acceptance rate, PR cycle review time improvements, etc. Qualitative feedback from principal/staff engs dominates. Working on code evals @xai now and the difference between what SWE agent/coding teams in frontier labs care about vs the median enterprise buyer is huge