0. One-Page Summary
Goal: empirically answer, “Does Razor reduce reasoning waste without harming quality?”
Design: A/B test on 2–5 reasoning-heavy workloads with identical model + infra + decoding.
Arms: A = baseline, B = baseline + Razor (prompt-only or controller), optional B2 = stronger controller.
Track: tokens/task, latency, peak KV-cache/memory, backtracking rate, tool calls, quality/accuracy.
Report: % deltas + confidence intervals + any quality regressions + “where it works best.”
1. Purpose of This Protocol
This page defines a neutral, empirical evaluation protocol for AI labs, platforms, and research groups who want to test Robbie’s Razor on their own systems. It is written for engineering and efficiency teams who need:
- a clear A/B test design,
- standard metrics and instrumentation,
- guidance on where Razor should be applied, and
- a repeatable way to compare baseline vs. Razor-guided reasoning.
This protocol is architecture-agnostic (transformers, MoE, multimodal, agents). It does not require retraining: you can test with prompts and controller logic, and optionally integrate Razor into preference/RL pipelines after initial results.
For the strategic rationale, see Why Robbie’s Razor Wins.
2. Scope & Recommended Use Cases
This protocol is designed for reasoning-heavy workloads where models tend to:
- produce long multi-step plans,
- use tools or APIs in loops,
- revisit the same evidence multiple times, or
- hallucinate and self-correct in extended dialogues.
Typical evaluation targets include:
- complex question answering and analysis,
- tool-using agents and planners,
- RAG with multi-hop retrieval,
- code generation with iterative refinement,
- policy and safety reasoning chains.
For short-form or single-shot classification tasks, the impact may be smaller. Highest leverage is long-context, multi-step, tool-using work.
3. Core Metrics to Track
These metrics let you quantify whether Razor reduces waste (tokens/compute/branching) while preserving or improving quality. Track all of them if possible.
3.1 Efficiency Metrics
- Tokens per request (and tokens per successful task if you have success criteria).
- FLOPs per request (or an internal FLOP proxy / GPU-time proxy).
- KV-cache peak / memory peak per request.
- Latency (P50 / P90 / P99) per workload.
3.2 Reasoning Structure Metrics
- Step count in plan/trace (where logged).
- Backtracking events (undo/reversal of conclusions).
- Redundancy rate (re-deriving/re-summarizing same content).
- Tool calls per request (for agents).
3.3 Quality & Robustness Metrics
- Accuracy / win-rate on your eval tasks.
- Hallucination rate (per your internal definition or labels).
- Self-contradiction rate within an answer.
- Preference (human or model judge) baseline vs Razor.
The target outcome is: lower tokens/compute + lower structural waste with stable or improved quality.
4. Experiment Design & A/B Setup
Recommended design: a controlled A/B experiment with stable infra and minimal confounders.
- Select representative workloads
Choose 2–5 workloads that reflect production usage: real prompts, internal eval sets, or benchmark suites. Include at least one long-context or multi-step task.
- Fix model + infrastructure
Same model version/weights, same infra, same decoding settings, same batching. No routing/quantization changes between arms.
- Define arms
- A (Baseline): current setup
- B (Razor): baseline + Razor guidance (prompt or controller)
- Optional B2: stronger controller configuration (more aggressive branch pruning)
- Control randomness
Use fixed seeds where possible. If stochastic, run enough samples to smooth variance (hundreds to low-thousands per workload).
- Log all metrics
Capture metrics (Section 3) using a consistent schema across arms.
Practical tip: If quality regresses in B, add B2 (controller) and tune: (a) recursion trigger strictness, (b) memory carry-forward size, (c) step budget.
5. Implementation Levels: How to Apply Razor
Test Razor at three integration depths. Start at Level 1 and escalate only if needed.
5.1 Level 1 — Prompt-Only Razor (Zero-Code)
Add a system instruction encouraging the model to follow compression → expression → memory → recursion:
You are required to organize your reasoning using Robbie’s Razor:
1. COMPRESSION
- Restate the problem in the smallest precise form.
- List only the essential facts.
2. EXPRESSION
- Propose a short plan of attack.
- Avoid exploring many speculative paths.
3. MEMORY
- Keep a small list of stable conclusions already reached.
- Reuse them instead of re-deriving or contradicting them.
4. RECURSION
- Only expand or branch when something is unclear or inconsistent.
- If a step does not change the answer or improve clarity, do not take it.
Prefer answers that:
- Use fewer steps with equal or better correctness,
- Reuse prior conclusions,
- Avoid backtracking and internal contradiction.
Best for smoke tests. It often reduces over-talking and redundancy immediately.
5.2 Level 2 — Controller-Level Razor (Step Filter)
Razor acts as a controller around the model:
- Maintain a
state object (question, context, small memory list).
- Ask the model for 1–3 next-step candidates and a short score for each (necessity, redundancy, impact).
- Select the step that maximizes compression + impact and minimizes redundancy.
- Execute only that step, then update
memory with stable conclusions.
- Repeat until stable.
This often yields fewer branches, lower KV-cache growth, and cleaner traces.
5.3 Level 3 — Training-Time Integration (Optional)
- Reward component: reward reuse, concision, non-contradiction, and phase completion.
- Preference labels: prefer solutions that solve with fewer redundant steps and higher coherence.
Not required for initial testing; consider after positive A/B results.
6. Analysis & Interpreting Results
Compute these deltas per workload and overall:
- % change in tokens per request (and per successful task where possible)
- % change in FLOPs / proxy
- % change in KV-cache peak
- Latency deltas (P50/P90/P99)
- Quality deltas (accuracy, win-rate, preference)
- Hallucination / backtracking changes
A “win” looks like:
- double-digit reductions in tokens/compute on reasoning-heavy workloads,
- equal or improved quality,
- lower backtracking and cleaner traces.
For environmental interpretation once you have tokens/compute deltas, see Environmental Impact & Computational Ecology.
Empirical Update — Depth-8 Refresh Cadence Evaluation (v1.4)
As part of ongoing protocol refinement, a controlled depth-8 recursive evaluation was conducted across multiple structured workloads to examine how refresh cadence influences stability under compression.
Three balanced refresh schedules were tested:
-
Sparse refresh: {1, 7}
-
Moderate refresh: {1, 4, 7}
-
Frequent refresh: {1, 3, 5, 7}
These were compared against compute-heavy (SOURCE available at every step) and memory-heavy (SOURCE available only at step 1) regimes.
Observed patterns:
-
Compute-heavy consistently achieved the highest fact retention across fixtures.
-
Under ID-collision workloads (near-identical identifiers and thresholds), increasing refresh frequency improved retention.
-
Under constraint-heavy workloads (hard limits, stop rules, guardrails), retention exhibited non-monotonic behavior. The intermediate cadence {1,4,7} performed best, while more frequent refresh degraded stability.
-
Guardrail-style constraints and stop conditions were the most perishable fact classes under capsule-only recursion.
These findings suggest that refresh cadence interacts with content structure rather than producing a universal optimal midpoint.
This evaluation is exploratory and fixture-dependent. It does not alter the canonical theory defined in MRD v1.8.
Full experimental details are available in the v1.4 empirical note within the benchmark repository.
Empirical Interpretation (Lab-Oriented Summary)
The v1.4 depth-8 cadence sweeps represent an early controlled probe of recursive stability under fixed resource constraints. The objective was not to validate a theorem, but to observe how refresh cadence interacts with content structure during bounded recursive compression.
Across tested fixtures, three consistent patterns emerged:
-
Compute-heavy regimes (SOURCE available at each step) achieved the highest retention.
This establishes an upper bound for stability under the current protocol.
-
Balanced refresh behavior was fixture-dependent.
-
Under collision-heavy content (near-identical identifiers and thresholds), increased refresh frequency improved retention monotonically.
-
Under constraint-heavy content (hard limits, stop rules, guardrails), retention exhibited non-monotonic behavior. An intermediate cadence performed best, while more frequent refresh degraded stability.
-
Guardrail-style constraints were the most perishable fact class under capsule-only recursion.
Hard limits, stop conditions, and meta-constraints degraded earlier than structural identifiers or numeric thresholds.
These observations suggest that refresh cadence interacts with content type rather than producing a universal optimal midpoint. Stability behavior appears sensitive to the structural composition of the task (collision-heavy vs. constraint-heavy) and the compression burden imposed by capsule size.
Importantly, these findings:
-
Do not modify R-level definitions.
-
Do not alter canonical theory in MRD v1.8.
-
Do not establish a universal stability minimum.
-
Are based on deterministic depth-8 trials and remain fixture-dependent.
They instead demonstrate that recursive compression dynamics are measurable, that cadence effects are observable, and that empirical probing can meaningfully refine deployment guidance for real workloads.
Further replication across additional fixtures, models, and capsule budgets is required before prescriptive operational guidance is issued.
Reproducible Evaluation Reference
A minimal, model-agnostic reference implementation is available for labs wishing to execute this protocol in practice:
- memory stabilization + selective replay patterns aligned with R3–R4 compliance,
- unit tests for confidence gating, eviction, replay bias, collision resilience,
- a clean execution surface without requiring retraining or architecture changes.
Reference implementation (GitHub):
https://github.com/RobbieRazor/robbies-razor-benchmarks
All contents are governed by the Authorship Conservation Rule (ACR) and remain attributed to Robbie George.
Razor Licensing & Evaluation Opens March 20, 2026
Formal evaluation and licensing pathways for Robbie’s Razor will open on March 20, 2026, following completion of the Grand Compression Foundation governance layer.
If your organization would like to be notified when evaluation slots become available, join the waitlist:
Join Waitlist
7. Reporting, Evaluation Tier & Next Steps
Labs can run this protocol privately. For formal evaluation terms, see the Free Evaluation Tier in the Licensing & Environmental Royalty Framework.
Typical next steps after positive results:
- Document findings internally
Summarize efficiency deltas, quality impact, and infra implications.
- Review licensing terms and caps
See the relevant sections on the Pricing & Caps page.
- Engage via the AI Labs portal
Use AI Labs & Licensing to discuss formal integration and reporting.
This protocol is intentionally neutral. It is designed to answer one question empirically: “Does Robbie’s Razor reduce our reasoning waste without harming quality?”
8. Quick Checklist for Engineering Teams
- ✅ Selected 2–5 representative reasoning-heavy workloads.
- ✅ Fixed model version, infra, and decoding settings for A/B.
- ✅ Implemented Level 1 (prompt-only) or Level 2 (controller) Razor integration.
- ✅ Logged tokens, FLOPs/proxy, KV-cache peak, latency, and quality metrics.
- ✅ Ran enough queries per arm to smooth randomness (hundreds to low-thousands).
- ✅ Compared baseline vs Razor across efficiency + quality + stability.
- ✅ Reviewed results with infra + product + research stakeholders.
For compliance scoring after evaluation, see Robbie’s Razor Compliance Framework.