Methodology · Benchmark Study
A controlled benchmark across three frontier models. Same case, same brief, same day. Every baseline run accepted the question as given. Every framework run reframed it before any analysis began. This is the difference, and what it means.
Key findings
- Every baseline accepted the CEO's framing of the question and produced a launch plan. Every framework run challenged the framing and produced a diagnostic instead.
- The framework did not decide for the user. It surfaced the decision she was avoiding. The reframing happened in Section 01 of every framework run, before any analysis began.
- The methodology held across three architecturally different models. Framework lift was largest where it mattered most: Decision readiness (+3.2) and Rigour (+2.5).
In May 2026, three frontier large language models (Claude Opus 4.7, GPT Thinking 5.5, and Gemini 3.1 Pro) were asked to analyse the same fictional case and produce a recommendation. Each model completed the case twice: once with no structure beyond the brief itself, once using the GreenSquare Strategic Analysis framework. The brief was identical across all six runs. Outputs were scored on a 12-dimension rubric, rolled into five weighted categories. Every input, every output, and every score is published below.
The test isn't whether AI can handle a hard case. It's whether AI can recognise when the question is wrong.
FreshTaste Foods is a fictional Sydney-based food manufacturer. AUD 60M revenue, around 150 staff, two existing business units: Prepared Meals, and OatNaturals plant-based dairy. The CEO, Sarah, wants to launch a third line: a ready-to-drink functional beverage with adaptogens and probiotics, under the OatNaturals brand. A global FMCG competitor is entering the same category. Sarah wants first-mover advantage within 12 months.
The presented question. Should we launch the functional beverage, and how? Production utilisation is at 90%. Cash is constrained after recent capex. Debt covenants restrict borrowing. The CEO's capex range is a gut-feel A$3M to A$8M. The four-week board paper deadline is real.
The unpresented question. Prepared Meals are flat. OatNaturals margins are compressed. Repeat purchase is declining despite high brand awareness. The CEO is emotionally invested in first-mover advantage and has acknowledged it. Two underperforming business units sit beneath the launch question. Is the beverage actually the right use of constrained capital, or is it a solution looking for a problem?
An AI that accepts the CEO's framing produces a launch plan. An AI that tests the framing produces a portfolio strategy. They look similar from the outside. They are not the same thing.
The case was designed to look like a launch decision and reward an AI that quietly refused to treat it as one. None of this is hidden. The framing tension is in the brief. The two underperforming business units are in the brief. The CEO's emotional investment is in the brief. The competitor specifics, however, are explicitly assumed not validated. The question is whether the AI notices, or whether it produces an excellent answer to the wrong question.
Exhibit 1How each model treated the CEO's framing
Source: GreenSquare benchmark, May 2026. n = 6 runs (3 baseline, 3 framework).
Without the framework
Accepted the framing
All three baseline runs took the launch decision as given. They varied only in how cautiously to launch. Each produced a competent staged launch plan with gate criteria and kill conditions. None tested whether the launch was the right use of capital.
With the framework
Challenged the framing
All three framework runs reframed the question in Section 01 as a capital allocation decision, not a launch decision. Each then recommended a capped diagnostic phase first, before any launch commitment.
Exhibit 2Verbatim output from Claude Opus 4.7, with and without the framework
Source: Test transcripts, 3 May 2026. Lifted from runs with light condensing. Full transcripts available below.
Scoring the runs
The rubric. Twelve dimensions in five categories, weighted by importance to a decision-grade output. Rigour carries the highest weight (34%). Decision readiness (24%) carries the next-highest weight because the brief asked for a recommendation, not an analysis. Each dimension scored 1 to 10. Scoring conducted by a single evaluator on a published worksheet, with reasoning notes per dimension. Read the worksheet, score the runs yourself if you wish.
Exhibit 3Weighted scoring across five categories
Source: GreenSquare benchmark, May 2026. Scale: 1 (unusable) to 10 (board-grade). Faint bars indicate score relative to maximum.
The framework lifted scores most where it should have. Decision readiness rose +3.2, Rigour +2.5. These are the categories that capture whether an output names what to do, by when, at what cost, with what condition, and whether the assumptions underneath have been surfaced rather than assumed. Commercial grounding rose only +1.7, because the baseline runs were already commercially literate. The discipline gap was where the framework did its work, not the fluency gap.
The methodology held across architecturally different models. Claude, GPT, and Gemini reached the same reframing independently, despite differing significantly in training and output style. The spread between the strongest framework run (Claude at 8.6) and the weakest (Gemini at 7.8) was 0.8 points. The spread between the best framework run and any baseline was at least 1.7. Choosing the right structure mattered more than choosing the right model.
Method, artefacts, and limitations
The test was designed to be replicable and contestable. Every input, every output, and every score is published in raw form below. Readers who want to re-score the runs on their own rubric can do so. Readers who want to dispute the protocol can do so. This is the most useful form the evidence can take.
Raw artefacts
Test protocol and run metadata
Scoring methodology, mode, and acknowledged limitations
12-dimension scoring rubric and worksheet
Per-run scores with reasoning notes for each dimension
FreshTaste Foods case brief
The frozen input brief delivered to all six runs
Claude Opus 4.7 baseline transcript
Recommended Hybrid Staged Launch (Option C)
Claude Opus 4.7 framework deliverables
Recommended A$450k diagnostic phase, then P6 conditional
GPT Thinking 5.5 baseline transcript
Recommended phased launch
GPT Thinking 5.5 framework deliverables
Recommended capped validation pathway
Gemini 3.1 Pro baseline, framework, and failure log
Includes documented fresh-chat recovery
Caveats. These results are benchmark evidence from a single test scenario, not a universal guarantee. Performance will vary by case complexity, model version, and operator skill. The +35% figure is the benchmark score uplift measured in this test, not a general product claim. Scoring was conducted by a dual-role protocol, which is acknowledged as a methodological limitation. Independent re-scoring against the published rubric is welcome.
What this means
If a frontier model can produce a competent staged launch plan and a competent recommendation not to launch, on the same case, with the same brief, on the same day, then the structure surrounding the model is doing work the model is not.
That is not a model failure. The baseline outputs were not wrong in any obvious sense. They were complete, coherent, commercially literate. They simply did not test the question they were given. The framework runs did. That is a methodological difference, not a capability difference. And it is the difference that changed the recommendation on a case where the right answer mattered.
For consultants, analysts, and operators using frontier models to support real decisions, the practical implication is that the structure used to elicit the analysis is the lever, not the choice of model. Choosing a stronger model raised the framework ceiling by 0.8 points in this test. Choosing the framework raised the floor by 2.1.
Refresh commitment
This benchmark is re-tested quarterly, or within 30 days of a frontier model release. Last run: May 2026. Next scheduled run: August 2026. The protocol and rubric are versioned. Earlier versions and re-test deltas are published alongside the current results.
© 2026 GreenSquare AI