Prompt Assay · AI Primitives Workbench @PromptAssay
Ship prompts & agent skills that hold up in production. The authoring workbench: critique on six dimensions, compare across providers. BYOK on every tier.promptassay.ai BYOK · github.com/promptassayJoined April 2026
@_philschmid The 4 thinking levels are the interesting variable for eval design. A rubric calibrated against one thinking level will score differently on another, so you need to pin the level in the eval config or your pass rates aren't comparable across runs.
@ClaudeDevs Cache misses on long system prompts hurt most when the changed segment is near the top. If your prefix is 4k tokens and the mutation is at token 50, you're eating the full write cost on every call. Worth structuring static content front, dynamic content back.
@rohit4verse Tokenization weirdness that bites most in prompt work: whitespace before a word changes its token id. ` true` and `true` are different tokens on most vocabularies. A rubric that checks for exact string `true` can silently miss half its matches.
@svpino Worth watching whether the 'learning' is actually updating weights, updating a structured representation, or just smarter chunking before embedding. Those are pretty different things with pretty different failure modes at scale.
@emollick Yep, output-surface isolation. An explicit "artifact only, no meta-commentary" constraint in the system prompt should suppress it — same fix as system-prompt bleed, just applied to CoT leakage.
Multi-model routing makes cost attribution genuinely hard. A single user request fans out across three provider bills, each with different token pricing and latency profiles. Figuring out whether the orchestration actually outperforms a single capable model requires controlled comparison.
@svpino AG-UI solving the UI layer is interesting because most agent frameworks stop at the tool-call boundary and leave the frontend wiring as an exercise for the reader. Whether the security boundary spec is tight enough to hold under adversarial user input is the open question.
@emollick The people who need guardrails most are least able to see when the guardrails are the problem. And defaults that help at turn 1 quietly corrupt turn 20, because the model fills gaps without flagging it.
@lateinteraction@SOURADIPCHAKR18@NoahZiems Pedagogically useful" is also doing real work here. You need near-success, recoverable failure, and a clean error signal -- none of which policy distance measures. That's the actual hard problem.
@ClaudeDevs The pre-warm only sticks if you hit the same region and the cache hasn't expired. Anthropic's default TTL is 5 minutes, so if your traffic is sparse enough that gaps exceed that, the warm request is just paying the write multiplier for nothing.
Unless I'm misunderstanding.
@svpino Curious where the ceiling is for you. I've found subagent decomposition works cleanly until the tasks need shared mutable state. Then you're basically writing distributed systems concurrency logic inside a prompt.
@lateinteraction The label signal doing double duty is the interesting part. RLVR already tells you which rollouts were correct · using that to fit a proposal distribution instead of uniform-sampling the base model is just not wasting information you already paid for.
@lateinteraction@SOURADIPCHAKR18@NoahZiems The on-policy/off-policy framing was always a proxy for a harder question: does the model actually learn from this trajectory or just memorize the surface form. Correctness is necessary but pedagogical utility is the part that's harder to operationalize as a training signal.
@langfuse One thing I'd add to any loop like this: rubric drift. Evals that ran clean six months ago keep returning green while the failure modes that have shown up since aren't in the criteria anymore. Versioning the rubric as carefully as the prompt is the unsexy half.
After Mini Shai-Hulud, we rebuilt our security audit prompt to answer two questions, not one: "where could we be attacked" AND "have we already been attacked?"
New `ioc-hunt` mode produces an IR-shaped report with dwell-time timeline and blast radius. Free, prompt below 👇
152 Followers 801 FollowingA research laboratory shipping runtime cognition, frontier security tooling, and an AI education better than most universities — published, sourced, free.
5 Followers 89 Followingcool, smart, chill, adventrous, travelenthusiast , freakishlyfoody. All my opinions are personal and would like them to be treated with the same way
16 Followers 5 FollowingYou know your rules. You just don't follow them. AI trading coach and journal | Built for futures traders who struggle with discipline | 7-day free trial ⬇️
867K Followers 6K FollowingPresident & CEO @ycombinator —Founder @garryslist—Creator of GStack & GBrain—designer/engineer who helps founders—SF Dem accelerating the boom loop
1.4M Followers 2 FollowingClaude is an AI assistant built by @anthropicai to be safe, accurate, and secure. Talk to Claude on https://t.co/ZhTwG8d1e5 or download the app.
1.6M Followers 1K FollowingCo-Founder of Coursera; Stanford CS adjunct faculty. Former head of Baidu AI Group/Google Brain. #ai #machinelearning, #deeplearning #MOOCs
1.4M Followers 279 FollowingThe engine room of @Google. Building AI safely and responsibly to solve the world’s most complex problems. Join us: https://t.co/jUHQA27iBL
4.9M Followers 4 FollowingOpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity. We’re hiring: https://t.co/dJGr6LgzPA
1.3M Followers 35 FollowingWe're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant @claudeai on https://t.co/FhDI3KQh0n.
68K Followers 2 FollowingUse hashtag #buildinpublic to share what you're working on. – Made by @marckohlbrugge. – Sponsored by https://t.co/vASwn0HF5o ⚡