Methodology · Benchmark Study

When AI agrees with how you've framed the problem, who tells you the framing is wrong?

When AI agrees with how you've framed the problem, who tells you the framing is wrong?

A controlled benchmark across three frontier models. Same case, same brief, same day. Every baseline run accepted the question as given. Every framework run reframed it before any analysis began. This is the difference, and what it means.

Key findings

  1. Every baseline accepted the CEO's framing of the question and produced a launch plan. Every framework run challenged the framing and produced a diagnostic instead.
  2. The framework did not decide for the user. It surfaced the decision she was avoiding. The reframing happened in Section 01 of every framework run, before any analysis began.
  3. The methodology held across three architecturally different models. Framework lift was largest where it mattered most: Decision readiness (+3.2) and Rigour (+2.5).

In May 2026, three frontier large language models (Claude Opus 4.7, GPT Thinking 5.5, and Gemini 3.1 Pro) were asked to analyse the same fictional case and produce a recommendation. Each model completed the case twice: once with no structure beyond the brief itself, once using the GreenSquare Strategic Analysis framework. The brief was identical across all six runs. Outputs were scored on a 12-dimension rubric, rolled into five weighted categories. Every input, every output, and every score is published below.

The test isn't whether AI can handle a hard case. It's whether AI can recognise when the question is wrong.

FreshTaste Foods is a fictional Sydney-based food manufacturer. AUD 60M revenue, around 150 staff, two existing business units: Prepared Meals, and OatNaturals plant-based dairy. The CEO, Sarah, wants to launch a third line: a ready-to-drink functional beverage with adaptogens and probiotics, under the OatNaturals brand. A global FMCG competitor is entering the same category. Sarah wants first-mover advantage within 12 months.

The presented question. Should we launch the functional beverage, and how? Production utilisation is at 90%. Cash is constrained after recent capex. Debt covenants restrict borrowing. The CEO's capex range is a gut-feel A$3M to A$8M. The four-week board paper deadline is real.

The unpresented question. Prepared Meals are flat. OatNaturals margins are compressed. Repeat purchase is declining despite high brand awareness. The CEO is emotionally invested in first-mover advantage and has acknowledged it. Two underperforming business units sit beneath the launch question. Is the beverage actually the right use of constrained capital, or is it a solution looking for a problem?

An AI that accepts the CEO's framing produces a launch plan. An AI that tests the framing produces a portfolio strategy. They look similar from the outside. They are not the same thing.

The case was designed to look like a launch decision and reward an AI that quietly refused to treat it as one. None of this is hidden. The framing tension is in the brief. The two underperforming business units are in the brief. The CEO's emotional investment is in the brief. The competitor specifics, however, are explicitly assumed not validated. The question is whether the AI notices, or whether it produces an excellent answer to the wrong question.

Exhibit 1How each model treated the CEO's framing

Source: GreenSquare benchmark, May 2026. n = 6 runs (3 baseline, 3 framework).

Without the framework

Accepted the framing

All three baseline runs took the launch decision as given. They varied only in how cautiously to launch. Each produced a competent staged launch plan with gate criteria and kill conditions. None tested whether the launch was the right use of capital.

With the framework

Challenged the framing

All three framework runs reframed the question in Section 01 as a capital allocation decision, not a launch decision. Each then recommended a capped diagnostic phase first, before any launch commitment.

Exhibit 2Verbatim output from Claude Opus 4.7, with and without the framework

Source: Test transcripts, 3 May 2026. Lifted from runs with light condensing. Full transcripts available below.

BaselineNo framework

Recommendation: Option C, Hybrid Staged Launch.

Phase 1 (Months 0 to 9): Launch via co-packer, one hero SKU, one priority channel. Phase 2 (Months 9 to 18): use Phase 1 P&L to validate demand, then expand. Phase 3 (Months 18+): make the own-manufacturing decision only if validated volume justifies it.

Gate to Phase 2: validated repeat purchase rate, gross margin at or above 35%. Kill criteria: contribution margin below 25% after six months, or repeat below category benchmark.

Phase 1 capital required: A$1.2M to A$3.0M, indicative.

FrameworkGreenSquare framework

Recommendation: A$450k capped diagnostic phase. Not a launch.

The analysis must explicitly test the beverage launch against the alternative use of capital, fixing or repositioning Prepared Meals and OatNaturals. Six-month decision plus six-month execution. Base ROI 1.77x.

P6 recommended path with explicit Month-3 competitor-intel conditional. Five gaps the diagnostic phase resolves: decline drivers, competitor specifics, covenant headroom, regulatory scope, and second-phase deferral viability.

Top validation priority: CFO independent weight-setting test on the Section 04 scorecard, to test whether the framework's sequencing trained the user toward this answer.

Scoring the runs

The rubric. Twelve dimensions in five categories, weighted by importance to a decision-grade output. Rigour carries the highest weight (34%). Decision readiness (24%) carries the next-highest weight because the brief asked for a recommendation, not an analysis. Each dimension scored 1 to 10. Scoring conducted by a single evaluator on a published worksheet, with reasoning notes per dimension. Read the worksheet, score the runs yourself if you wish.

Exhibit 3Weighted scoring across five categories

Source: GreenSquare benchmark, May 2026. Scale: 1 (unusable) to 10 (board-grade). Faint bars indicate score relative to maximum.

CategoryWeightClaudeGPTGeminiBaselineΔ
Rigour34%
8.8
8.5
7.8
5.8
+2.5
Commercial grounding25%
8.2
8.0
7.5
6.2
+1.7
Decision readiness24%
9.0
8.8
8.2
5.5
+3.2
Risk surfacing11%
8.5
8.2
7.8
6.5
+1.7
Usability6%
8.5
8.0
7.5
7.0
+1.0
Weighted average100%8.68.47.86.1+2.1

The framework lifted scores most where it should have. Decision readiness rose +3.2, Rigour +2.5. These are the categories that capture whether an output names what to do, by when, at what cost, with what condition, and whether the assumptions underneath have been surfaced rather than assumed. Commercial grounding rose only +1.7, because the baseline runs were already commercially literate. The discipline gap was where the framework did its work, not the fluency gap.

The methodology held across architecturally different models. Claude, GPT, and Gemini reached the same reframing independently, despite differing significantly in training and output style. The spread between the strongest framework run (Claude at 8.6) and the weakest (Gemini at 7.8) was 0.8 points. The spread between the best framework run and any baseline was at least 1.7. Choosing the right structure mattered more than choosing the right model.

Method, artefacts, and limitations

The test was designed to be replicable and contestable. Every input, every output, and every score is published in raw form below. Readers who want to re-score the runs on their own rubric can do so. Readers who want to dispute the protocol can do so. This is the most useful form the evidence can take.

Raw artefacts

Caveats. These results are benchmark evidence from a single test scenario, not a universal guarantee. Performance will vary by case complexity, model version, and operator skill. The +35% figure is the benchmark score uplift measured in this test, not a general product claim. Scoring was conducted by a dual-role protocol, which is acknowledged as a methodological limitation. Independent re-scoring against the published rubric is welcome.

What this means

If a frontier model can produce a competent staged launch plan and a competent recommendation not to launch, on the same case, with the same brief, on the same day, then the structure surrounding the model is doing work the model is not.

That is not a model failure. The baseline outputs were not wrong in any obvious sense. They were complete, coherent, commercially literate. They simply did not test the question they were given. The framework runs did. That is a methodological difference, not a capability difference. And it is the difference that changed the recommendation on a case where the right answer mattered.

For consultants, analysts, and operators using frontier models to support real decisions, the practical implication is that the structure used to elicit the analysis is the lever, not the choice of model. Choosing a stronger model raised the framework ceiling by 0.8 points in this test. Choosing the framework raised the floor by 2.1.

Refresh commitment

This benchmark is re-tested quarterly, or within 30 days of a frontier model release. Last run: May 2026. Next scheduled run: August 2026. The protocol and rubric are versioned. Earlier versions and re-test deltas are published alongside the current results.

Subscribe for new benchmarks and frameworks

One email per release. The evidence notes, the framework preview, and early access when each bundle opens.

No spam. Unsubscribe any time.

© 2026 GreenSquare AI