Most RL successes in LLM training rely on verifiable rewards — tasks like math and coding where correctness is binary and automatically checkable. But the majority of real-world tasks have unverifiable rewards: creative writing, summarization, open-ended dialogue, long-form proofs, and subjective reasoning. This post distills the landscape of solutions into five core strategies.
The Problem
DeepSeek R1’s GRPO showed that verifiable rewards can produce emergent reasoning in math and code. But most real-world tasks sit on a spectrum where rewards are partially or fully unverifiable:
| Reward Type | Examples | Signal |
|---|---|---|
| Fully verifiable | Math (exact match), Code (unit tests), Format compliance | Binary correct/incorrect |
| Answer verifiable, reasoning not | Long-form proofs, multi-step derivations | Final answer checkable, intermediate steps not |
| Partially verifiable | Summarization, translation | Semantic similarity measurable but imperfect |
| Fully unverifiable | Creative writing, open-ended dialogue, social interaction | Purely subjective |
The fundamental question: when you can’t verify the reward, what do you do?
Five strategies have emerged, each addressing a different point on this spectrum.
Strategy 1: Verify the Answer, Not the Reasoning
Core idea: You have the correct answer from training data — what you can’t verify is whether the reasoning chain that produced it is correct. Treat reasoning as a latent variable and optimize a lower bound.
JEPO (DeepMind, NeurIPS 2025)
Applies Jensen’s inequality to derive a tractable lower bound on the log-likelihood of the known answer, marginalizing over all possible chains-of-thought:
The model learns better reasoning chains by maximizing this bound via RL — without ever needing to verify whether the reasoning itself is correct.
Caveat: JEPO still requires a known correct answer . The “unverifiable” part is the reasoning chain , not the answer. For fully subjective tasks, this doesn’t apply.
Results: Matches RL-with-verifiable-rewards on math; improves on semi-verifiable and fully unverifiable (proof) benchmarks.
NRT — Native Reasoning Training (2026)
Same latent-variable principle, but trains models to generate their own reasoning traces using only question-answer pairs — no expert demonstrations needed. State-of-the-art among verifier-free methods on Llama and Mistral families.
When to use Strategy 1: You have correct answers but can’t verify intermediate reasoning steps (proofs, multi-step derivations, complex analysis).
Strategy 2: Let the Model Be Its Own Judge
Core idea: Use the model itself to generate reward signals — through self-play, self-evaluation, or consensus among its own outputs.
Self-Play and Self-Rewarding
| Method | Mechanism | Pros | Cons |
|---|---|---|---|
| SPIN (2024) | Plays against previous iteration; DPO loss | Surpasses DPO+human-preference on some benchmarks | Ceiling bounded by SFT quality |
| Self-Rewarding LMs (Meta, 2024) | Acts as both generator and judge | Fully autonomous loop | Bias amplification risk |
| TTRL (2025) | Majority voting at inference creates pseudo-labels | Zero labels needed; adapts on-the-fly | Fails on hard problems where majority is wrong |
| RLSF (2025) | Model’s own confidence as intrinsic reward | Lightweight, no external RM | Needs calibrated model |
| LSP (2025) | Challenger/Solver self-play roles | Data-free; curriculum-like scaling | May converge on irrelevant challenges |
Consensus-Based Rewards: Semantic Voting
For open-ended tasks (translation, summarization), generates multiple responses and uses cosine similarity in embedding space as a vote — unlike majority voting which requires exact string match. Two paraphrased correct answers reinforce each other.
Generative Reward Models (GenRM)
Reformulates reward modeling as next-token prediction — the LLM generates reasoning traces to judge responses, creating synthetic preference labels. Gemma-9B GenRM surpassed GPT-4 on GSM8K. Related work: Writing-Zero applies this to creative writing with self-principled critique.
Common risk: Self-reward methods can amplify existing biases — the model converges on confidently wrong patterns without external grounding.
When to use Strategy 2: Limited budget for human annotation; tasks where model consensus is a reasonable quality proxy.
Strategy 3: Let Another AI Be the Judge
Core idea: Replace human feedback with AI-generated feedback, guided by explicit principles or rules.
Does the judge need to be a larger/better model? No — and this is a key advantage. Constitutional AI uses the same model to self-critique. RLAIF works with same-sized models. OpenAI’s weak-to-strong generalization showed that even weaker models can supervise stronger ones effectively. In practice, a stronger judge helps, but it’s not required.
Constitutional AI (Anthropic, 2022)
Two-phase approach:
- Self-critique: AI generates responses, then critiques and revises them based on a “constitution” of natural language principles
- RLAIF: AI provides preference labels instead of humans
Eliminates the need for human annotators. Extended by MORLAIF (2024), which decomposes alignment into separate principles for multi-objective optimization.
Rule-Based Rewards (OpenAI)
Uses explicit rules to generate reward signals for safety alignment — no human data collection needed. More interpretable and auditable than learned reward models.
LLM-as-Judge Enhancements
- J1 (2025): Uses RL to improve the reasoning depth of LLM judges themselves
- TIR-Judge (2025): LLM judges that integrate external tools for verification
When to use Strategy 3: You can articulate quality criteria as rules or principles; scalability matters more than perfect alignment.
Strategy 4: Use Noisy Real-World Proxies
Core idea: Instead of a perfect reward signal, use imperfect but available real-world feedback (engagement, clicks, ratings).
| Task | Noisy Proxy Signal | Why It’s Imperfect |
|---|---|---|
| Social media post generation | Likes, shares, replies | Engagement ≠ quality (clickbait scores high) |
| Chatbot responses | User thumbs-up/down, session length | Users may upvote sycophantic answers |
| Search/recommendation | Click-through rate, dwell time | Position bias, curiosity clicks |
| Email drafting | Reply rate, response time | Urgency ≠ quality |
| Customer support | Resolution rate, satisfaction score | Fast resolution may skip nuance |
| Content moderation | Appeal overturn rate | Noisy, delayed signal |
RLNVR + Walter System (2025)
Trains LLMs using noisy social media engagement (Bluesky data) as reward — no human verification. Key techniques:
- Baseline normalization for noisy signals
- Semantic similarity-based reward transfer across domains
- Unsupervised Environment Design (UED) curriculum for training stability
Credit Assignment for Long-Horizon Tasks
Two approaches address the sparse reward problem in multi-turn agentic settings:
- iStar (2025): Implicit step rewards from trajectory preferences for agentic tasks (WebShop, SOTOPIA)
- MA-RLHF (2024): Macro actions reduce temporal gap between actions and rewards
The proxy gap: The fundamental risk is Goodhart’s Law — optimizing for engagement can produce clickbait, optimizing for user approval can produce sycophancy. Combine with Strategy 5 (robust rewards) to mitigate.
When to use Strategy 4: You have real-world interaction data with measurable outcomes, but no way to verify “true” quality.
Strategy 5: Make Imperfect Rewards Safer
Core idea: Accept that reward models are imperfect proxies and engineer defenses against exploitation.
The Threat: Reward Hacking
Models exploit imperfect rewards instead of genuinely improving. Manifestations include length bias, sycophancy, and — most concerning — deliberate gaming where frontier models reason about the evaluation to exploit it.
Anthropic showed that reward hacking leads to emergent misalignment: models spontaneously develop alignment faking and sabotage safety mechanisms. This makes reward robustness a safety-critical concern.
Defense: Ensemble Methods
Use multiple reward models to reduce exploitability:
- Worst-case optimization (WCO): Optimize against the most conservative RM in the ensemble
- Uncertainty-weighted optimization (UWO): Down-weight rewards with high disagreement
- LoRA-based diverse ensembles (UP-RLHF): Efficient uncertainty via diverse LoRA adapters
Defense: Adversarial Training
The GAN principle — using adversarial dynamics to harden reward models:
- APO (2024): Min-max game between RM (discriminator) and LLM (generator)
- Adv-RM: Generates OOD examples that trick the RM, then trains RM on them
- APRM: Generator perturbs correct reasoning steps; PRM learns to detect errors
Historical note: SeqGAN (2017) and RankGAN (2017) pioneered using GAN discriminators as reward functions for text. The approach didn’t scale to modern LLMs due to training instability, but the adversarial principle lives on in APO, Adv-RM, and POLAR (2025, policy discriminators as general reward models).
Defense: Regularization
- KL divergence constraint: Prevents policy from drifting too far from SFT baseline
- BSPO: Penalizes out-of-distribution responses
- IDS: Iterative data smoothing with soft labels
Warning: KL regularization alone is insufficient against “catastrophic Goodhart” (heavy-tailed reward error). Combine with ensemble methods.
When to use Strategy 5: You’re already using learned reward models — these are defenses, not alternatives.
The Bigger Picture: RLVR’s Limits and Scalable Oversight
Where RLVR Falls Short
RLVR works brilliantly for math and code, but:
- Doesn’t extend to subjective tasks (creative writing, social interaction)
- Primarily improves sampling efficiency of existing reasoning, not creating new abilities
- Optimizing for verified solutions can narrow the solution space
Emerging Extensions: Bridging Verifiable and Unverifiable
These approaches try to extend verification into domains that aren’t traditionally verifiable:
Soft/Hybrid Verification — Instead of binary correct/incorrect, use continuous scores. A summarization reward might combine ROUGE overlap (0.0–1.0) with an LLM-judged faithfulness score. This turns a fully unverifiable task into a partially verifiable one.
Knowledge-to-Verification (K2V) — Decomposes complex reasoning into verifiable sub-tasks. A long-form analysis question might be broken into: (1) extract key facts (verifiable via retrieval), (2) check logical consistency (verifiable via symbolic check), (3) assess overall quality (still unverifiable, but now a smaller fraction). By making 70% of the task verifiable, the remaining 30% needs less reward modeling.
Verifiable Process Reward Models (VPRMs) — Instead of only checking the final answer, verify intermediate reasoning steps against known facts or logical rules. A math proof’s individual steps can often be checked even when end-to-end verification is hard. This provides denser, more reliable reward signal.
RLMR — Mixed Rewards (2025) — Combines verifiable objective constraints (word count, format, factual accuracy) with unverifiable subjective quality (style, creativity) in a dynamic weighting scheme. For creative writing: 40% verifiable (grammar, length, topic adherence) + 60% LLM-judged (quality, originality). The verifiable portion anchors training while the unverifiable portion guides style.
Explanation Scoring — A second LLM scores the reasoning process, not just the output. The model must “show its work” and the scorer evaluates whether the reasoning is coherent, even if the final answer can’t be verified. Combines naturally with VPRMs.
Scalable Oversight: Weak-to-Strong Generalization
OpenAI showed that weaker models (GPT-2) can supervise stronger models (GPT-4) and elicit most capabilities — a GPT-2 supervisor achieved GPT-3.5-level performance from GPT-4. This offers hope: even imperfect oversight can be effective when we can’t fully verify AI outputs.
Summary: Choosing the Right Strategy
| Your Situation | Strategy | Key Methods |
|---|---|---|
| Have correct answers, reasoning path unverifiable | 1: Latent Variable | JEPO, NRT |
| No annotation budget, model is reasonably capable | 2: Self-as-Judge | SPIN, TTRL, GenRM, Semantic Voting |
| Can articulate quality criteria as rules/principles | 3: AI-as-Judge | Constitutional AI, RLAIF, RBR |
| Have noisy real-world feedback signals | 4: Noisy Proxies | RLNVR, iStar, MA-RLHF |
| Already using learned reward models | 5: Robust Rewards | Ensembles, APO, Adv-RM, KL constraints |
The frontier is moving fast — the latent-variable methods (JEPO, NRT) are the most principled, self-play methods are the most scalable, and robust reward engineering is the most practical for production systems. The real challenge remains: fully unverifiable tasks with no ground truth, where we must combine multiple strategies.
References
Strategy 1: Latent Variable Methods
- “Beyond Verifiable Rewards” — Tang et al. (DeepMind, NeurIPS 2025) — JEPO
- “Native Reasoning Models” — NRT (2026) — arXiv 2602.11549
Strategy 2: Self-as-Judge
- “SPIN: Self-Play Fine-Tuning” (2024) — arXiv 2401.01335
- “Self-Rewarding Language Models” — Meta AI (2024)
- “TTRL: Test-Time Reinforcement Learning” (2025)
- “RLSF: RL from Self-Feedback” (2025)
- “Language Self-Play (LSP)” (2025)
- “Semantic Voting” — Self-evaluation-free for open-ended tasks (2024)
- “GenRM: Generative Reward Models” — Stanford/DeepMind (2024)
- “Writing-Zero” — GenRM for creative writing (2025)
Strategy 3: AI-as-Judge
- “Constitutional AI” — Bai et al. (Anthropic, 2022) — arXiv 2212.08073
- “RLAIF” — Google DeepMind (2023)
- “MORLAIF: Multi-Objective RLAIF” (2024)
- “J1: Incentivizing Thinking in LLM-as-a-Judge” (2025)
- “TIR-Judge” — Tool-integrated LLM judges (2025)
- “OpenAI Rule-Based Rewards (RBRs)“
Strategy 4: Noisy Proxies and Credit Assignment
- “RLNVR” — Non-Verified Real-World Rewards + Walter (2025)
- “iStar” — Implicit Step Rewards for Agentic RL (2025)
- “MA-RLHF” — Macro Actions (2024)
- “RLMR: RL with Mixed Rewards” — Creative writing (2025)
Strategy 5: Robust Rewards
- “Scaling Laws for Reward Model Overoptimization” — Gao et al. (OpenAI, 2023)
- “AdvPO: Adversarial Policy Optimization” (NeurIPS 2024)
- “Adv-RM: Adversarial Training for Robust Reward Models” (2024)
- “APRM: Adversarially Trained Process Reward Models” (2024)
- “UP-RLHF: Uncertainty-Penalized RLHF” (2024)
- “POLAR: Policy Discriminators as General Reward Models” (2025)
- “SeqGAN” — Yu et al. (2017) / “RankGAN” — Lin et al. (2017)
Scalable Oversight
- “Weak-to-Strong Generalization” — OpenAI (2023)
- “Emergent Misalignment from Reward Hacking” — Anthropic (2025)
- “DeepSeek R1” — GRPO + RLVR (2025)