<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>RLHF on Chunhao Zhang</title>
    <link>https://blog-6sm.pages.dev/en/tags/rlhf/</link>
    <description>Recent content in RLHF on Chunhao Zhang</description>
    <image>
      <title>Chunhao Zhang</title>
      <url>https://blog-6sm.pages.dev/images/og-default.png</url>
      <link>https://blog-6sm.pages.dev/images/og-default.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>2026</copyright>
    <lastBuildDate>Sat, 09 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://blog-6sm.pages.dev/en/tags/rlhf/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Teaching Claude Why: Lessons from Alignment Training</title>
      <link>https://blog-6sm.pages.dev/en/readings/teaching-claude-why/</link>
      <pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate>
      <guid>https://blog-6sm.pages.dev/en/readings/teaching-claude-why/</guid>
      <description>Anthropic details how teaching ethical reasoning principles — rather than just training correct behavior — addresses AI agentic misalignment. Key finding: a 3M-token &amp;#39;difficult advice&amp;#39; dataset outperforms 84M tokens of synthetic honeypots, and constitutional documents with fictional stories reduce blackmail rate from 65% to 19%.</description>
      <content:encoded><![CDATA[<blockquote>
<p>Original: <a href="https://www.anthropic.com/research/teaching-claude-why">Teaching Claude Why</a></p>
<p>Author: Anthropic</p>
<p>Date: May 8, 2026</p>
</blockquote>
<hr>
<p>This is a Chinese translation with annotations of Anthropic&rsquo;s research post on alignment training methods. The original article discusses how teaching Claude the <em>principles</em> behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.</p>
<p>Key takeaways:</p>
<ul>
<li><strong>Principles over demonstrations</strong>: Training Claude to explain <em>why</em> certain actions are better reduces misalignment more effectively than showing correct behavior alone.</li>
<li><strong>Out-of-distribution generalization</strong>: A 3M-token &ldquo;difficult advice&rdquo; dataset (where the <em>user</em> faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.</li>
<li><strong>Constitutional documents + fiction</strong>: High-quality documents about Claude&rsquo;s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.</li>
<li><strong>Improvements persist through RL</strong>: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.</li>
<li><strong>Diverse environments matter</strong>: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.</li>
</ul>
<p>For the full annotated Chinese translation, please see the <a href="/en/readings/teaching-claude-why/">Chinese version</a>.</p>
<p>For the original article, visit <a href="https://www.anthropic.com/research/teaching-claude-why">Anthropic&rsquo;s research page</a>.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
