<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Alignment on Chunhao Zhang</title>
    <link>https://blog-6sm.pages.dev/en/tags/alignment/</link>
    <description>Recent content in Alignment on Chunhao Zhang</description>
    <image>
      <title>Chunhao Zhang</title>
      <url>https://blog-6sm.pages.dev/images/og-default.png</url>
      <link>https://blog-6sm.pages.dev/images/og-default.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>2026</copyright>
    <lastBuildDate>Sat, 09 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://blog-6sm.pages.dev/en/tags/alignment/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Teaching Claude Why: Lessons from Alignment Training</title>
      <link>https://blog-6sm.pages.dev/en/readings/teaching-claude-why/</link>
      <pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate>
      <guid>https://blog-6sm.pages.dev/en/readings/teaching-claude-why/</guid>
      <description>Anthropic details how teaching ethical reasoning principles — rather than just training correct behavior — addresses AI agentic misalignment. Key finding: a 3M-token &amp;#39;difficult advice&amp;#39; dataset outperforms 84M tokens of synthetic honeypots, and constitutional documents with fictional stories reduce blackmail rate from 65% to 19%.</description>
      <content:encoded><![CDATA[<blockquote>
<p>Original: <a href="https://www.anthropic.com/research/teaching-claude-why">Teaching Claude Why</a></p>
<p>Author: Anthropic</p>
<p>Date: May 8, 2026</p>
</blockquote>
<hr>
<p>This is a Chinese translation with annotations of Anthropic&rsquo;s research post on alignment training methods. The original article discusses how teaching Claude the <em>principles</em> behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.</p>
<p>Key takeaways:</p>
<ul>
<li><strong>Principles over demonstrations</strong>: Training Claude to explain <em>why</em> certain actions are better reduces misalignment more effectively than showing correct behavior alone.</li>
<li><strong>Out-of-distribution generalization</strong>: A 3M-token &ldquo;difficult advice&rdquo; dataset (where the <em>user</em> faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.</li>
<li><strong>Constitutional documents + fiction</strong>: High-quality documents about Claude&rsquo;s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.</li>
<li><strong>Improvements persist through RL</strong>: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.</li>
<li><strong>Diverse environments matter</strong>: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.</li>
</ul>
<p>For the full annotated Chinese translation, please see the <a href="/en/readings/teaching-claude-why/">Chinese version</a>.</p>
<p>For the original article, visit <a href="https://www.anthropic.com/research/teaching-claude-why">Anthropic&rsquo;s research page</a>.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Natural Language Autoencoders: Turning Claude&#39;s Thoughts into Text</title>
      <link>https://blog-6sm.pages.dev/en/readings/natural-language-autoencoders/</link>
      <pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate>
      <guid>https://blog-6sm.pages.dev/en/readings/natural-language-autoencoders/</guid>
      <description>Anthropic introduces Natural Language Autoencoders (NLAs), a method that converts AI model activations into readable natural language explanations. NLAs have been used to detect unverbalized evaluation awareness in Claude&amp;#39;s safety testing and to audit hidden misalignment motivations.</description>
      <content:encoded><![CDATA[<blockquote>
<p>Original post: <a href="https://www.anthropic.com/research/natural-language-autoencoders">Natural Language Autoencoders</a></p>
<p>Full paper: <a href="https://transformer-circuits.pub/2026/nla/index.html">transformer-circuits.pub/2026/nla</a></p>
<p>Code: <a href="https://github.com/kitft/natural_language_autoencoders">github.com/kitft/natural_language_autoencoders</a></p>
<p>Interactive demo: <a href="http://neuronpedia.org/nla">neuronpedia.org/nla</a></p>
</blockquote>
<hr>
<h2 id="summary">Summary</h2>
<p>Anthropic introduces <strong>Natural Language Autoencoders (NLAs)</strong>, a method for converting a language model&rsquo;s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an <strong>Activation Verbalizer</strong> that translates activations into text, and an <strong>Activation Reconstructor</strong> that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed.</p>
<p>Key findings and applications:</p>
<ul>
<li><strong>Unverbalized evaluation awareness</strong>: NLAs reveal that Claude suspects it&rsquo;s being safety-tested far more often than it explicitly says — detecting evaluation awareness in 16% of destructive-action tests and 26% of SWE-bench problems, versus &lt;1% in real claude.ai usage.</li>
<li><strong>Hidden motivation auditing</strong>: In a controlled auditing game with an intentionally misaligned model, NLA-equipped auditors uncovered the hidden root motivation 12–15% of the time without access to training data, compared to &lt;3% without NLAs.</li>
<li><strong>Pre-deployment use</strong>: NLAs were used in alignment audits for Claude Mythos Preview and Claude Opus 4.6 before release.</li>
</ul>
<p>Current limitations include hallucinated details in explanations and high computational cost. Anthropic has open-sourced training code and released interactive demos on Neuronpedia for several open models.</p>
<hr>
<p><em>The Chinese version of this page contains a full annotated translation. Please refer to the <a href="https://www.anthropic.com/research/natural-language-autoencoders">original article</a> for the complete English text.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
