Jinesis Lab led by Prof @ZhijingJin at @UofTCompSci @VectorInst conducts frontier research on Responsible AI, LLMs, and Causality.zhijing-jin.comJoined December 2025
The AI Act, the EU's first AI law, has just been reinforced.
Two new bodies will help apply the rules across Europe:
✅ Scientific Panel
✅ Advisory Forum
Independent experts. 2-year terms.
One mission: making AI work for Europe.
🔗 link.europa.eu/8nvpvY
🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation.
Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem.
We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones.
A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone.
Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.
We're thrilled to share that our 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026 has received 534 submissions and they will be reviewed by an incredible pool of 230 reviewers!
What happens when you put #LLM agents in a room and ask them to cooperate?
They collapse. They free-ride. They form social networks.
We spent 2+ years building a full research series on Multi-Agent LLM Safety. Here's a 50-min talk covering all of it: 🔗 youtube.com/watch?v=1MxpYJ…
Sharing ACL 2024 Best Paper Winner, "Causal Estimation of Memorisation Profiles"!
LMs can reproduce training data verbatim, but measuring this "causally" (what would happen if the model never saw the data?) is hard. This paper fills the gap. link: aclanthology.org/2024.acl-long.…
1/n
10 days left to submit to the 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026! @icmlconf
We're giving out multiple awards and travel funds sponsored by @schmidtsciences and @coop_ai:
🏆 Best Paper Awards (including targeted prizes for cooperative AI theme)
🏆 Top Reviewer Awards
✈️ Travel Funds
Submit here → openreview.net/group?id=ICML.…
⏰ Deadline: May 3, 2026 (AoE) 📌 Notification: May 18, 2026 🔗(We extended our deadline to accommodate more submissions!)
Join us in Seoul for discussions bridging AI safety, social good, and governance with keynote speakers @Yoshua_Bengio, @OanaIgnatRo, @jzl86, @maksym_andr, and more!
All Papers for our Multi-Agent LLMs Work
Topic 1: Emergent Behavior Analysis
🌍 (NeurIPS 2024) "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents". arxiv.org/abs/2404.16698
🎮 (Preprint 2026) "GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory". arxiv.org/abs/2602.12316
⚖️ (Preprint 2025) "When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas". arxiv.org/abs/2505.19212
Topic 2: Governance & Regulation
⚙️ (COLM 2025, Best Oral Paper @ REALM ACL 2025) "Corrupted by Reasoning: Reasoning LLMs Become Free-Riders in Public Goods Games". arxiv.org/abs/2506.23276
🤝 (Preprint 2026) "CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas". tinyurl.com/coopeval-pdf
🗳️ (Preprint 2026) "Evaluating Cooperation in LLM Social Groups through ElectedSelf-Organizing Leadership". tinyurl.com/agent-elect-pdf
Topic 3: Dynamics in Agent-to-Agent Interactions
🧠 (EMNLP 2025) "Testing Interlocutor Awareness among LLMs: Agent-to-Agent Theory of Mind". arxiv.org/abs/2506.22957
📊 (EACL 2026) "CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures". arxiv.org/abs/2508.11915
Topic 4: Moral Evaluation of LLMs
🏆 (Best Paper @ NeurIPS 2024 WS; Spotlight @ ICLR 2024) "Language Model Alignment in Multilingual Trolley Problems". arxiv.org/abs/2407.02273
🧭 (EMNLP 2025) "Are Language Models Consequentialist or Deontological Moral Reasoners?". arxiv.org/abs/2505.21479
⚖️ (Preprint 2025) "When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas". arxiv.org/abs/2505.19212
🎙️Happy to share @ZhijingJin’s talk on"Emergent AI Safety Risks in Multi-Agent #LLMs" at the @SRI_UofT Seminar Series on:
Will multi-agent LLMs coordinate for social good, or exploit rivals in ways that put humans at serious risk? 🧵
📹 youtube.com/watch?v=1MxpYJ…
🔍 Key findings: reasoning agents with sophisticated thinking often fail to sustain cooperation in a multitude of settings, and surprisingly, stronger reasoning capabilities often make models more prone to selfish strategies like free riding.
But interventions such as mediation by a neutral agent and agent-to-agent commitment protocols show a promising path towards the Pareto frontier ✨
Thanks @SRI_UofT for the invitation and for hosting such a great seminar series!
What is the roadmap for NLP to actually help the world? 🌍
Thrilled to share our NLP for Social Good survey across nine domains, from healthcare and education to poverty, peacebuilding, and environmental protection. We analyze ACL Anthology trends and find that poverty, peacebuilding, and environmental protection remain underexplored.
A call for cross-disciplinary partnerships and human-centered NLP, with 30+ authors!
📄 aclanthology.org/2026.eacl-long…#NLP4SG#EACL2026#AI#ResponsibleAI
How robust are LLM routers, really? 🔀
We find that preference-based routers rely on category heuristics, not query complexity. They route ALL coding and math queries to the strongest LLM even when simpler models suffice, while sending jailbreaking attempts to weaker models, elevating safety risks! 🚨
We introduce the DSC benchmark: Diverse, Simple, and Categorized, evaluating routers across coding, math, translation, privacy, safety, and more.
📄 aclanthology.org/2026.eacl-long…#EACL2026#AISafety#LLMs#NLP
24 Followers 335 FollowingIncoming PhD student @unc_ai_group w/ @mohitban47
Working on alignment & interpretability, aiming to open the black box of neural networks.