GEO · MCP · Evals

geocheck

An AI-search brand-visibility (GEO) checker that is both a CLI and an MCP server, shipped with its own reproducible eval harness — transparent unit-tested metrics and an extractor graded against human labels behind a CI gate. Runs offline with no API key.

GEOMCPEVALSPYTHON

GitHub repo

Honest outcomes

visibility states

extraction F1

0.985

on a hand-labelled gold set

Cohen's κ

0.976

judge-vs-human agreement

tests / coverage

75 / 98%

CI eval gate

F1 ≥ 0.90, κ ≥ 0.80

01 —

Why

When a buyer asks an AI assistant "what are the best tools for X?", a handful of brands get named and a handful of URLs get cited. If your brand is absent, you are invisible to that buyer — and classic SEO rank does not predict it. This is GEO: Generative Engine Optimization, measuring brand presence inside AI answers rather than blue-link rankings.

Most attempts to measure it hand-wave the hard part. They ask an LLM "is this brand visible?" and trust the answer. But extraction — deciding which brands an answer names, with what sentiment, and whether the brand domain is actually cited — is exactly the step that can be wrong. An unmeasured extractor turns a visibility report into a confident guess.

I wanted the opposite: a tool small enough to run in one command, transparent enough that every metric formula is unit-tested, and honest enough that the extractor is graded against human labels behind a CI gate. If the grader regresses, the build fails.

The extractor decides which brands an answer names and whether a domain is cited — that step can be wrong, so it is graded against human labels, not assumed correct.

why the eval harness ships in the box

02 —

What

geocheck is an AI-search brand-visibility checker that is both a CLI and an MCP server, shipped with its own reproducible eval harness. It scores a brand across the major answer engines (ChatGPT, Perplexity, Google AI Overview, Gemini, Claude) using a fixed metric vocabulary and a five-state model for where a brand can land in any one answer.

For each answer a brand lands in exactly one of five visibility states. On top of that sit transparent, unit-tested metrics: visibility %, mention rate, citation rate, share of voice, sentiment, average citation position, position score, the mention–citation gap, a consistency signal for run-to-run non-determinism, and a blind-spot rate for when you are invisible while a competitor is present.

It runs in one command with no API key — the default mock provider replays recorded fixtures, so the demo, the tests, and CI are fully reproducible offline. Point it at a real provider with your own key (and enough runs per prompt) when you want live numbers. The same logic is exposed over the Model Context Protocol so an agent can call it as a tool.

03 —

How

The extractor is the risky component, so it is measured, not assumed. Its output — named brands, sentiment, and whether the domain is cited — is graded against a hand-labelled gold set on every CI run. The gate is F1 ≥ 0.90 and Cohen’s κ ≥ 0.80; below it, the eval command exits non-zero and CI fails. At the time of writing it grades F1 0.985 and κ 0.976.

The metric formulas are deterministic and unit-tested rather than left to an LLM, so a share-of-voice or citation-position number is reproducible and auditable. The demo fixtures are synthetic — that is stated plainly — so anyone can verify the maths without spending a cent or leaking a key.

Because AI answers are non-deterministic, geocheck treats a single run as meaningless: it samples each prompt many times and reports the run count alongside a consistency signal, so the numbers carry their own uncertainty instead of pretending to be exact.

04 —

Where it stands

geocheck is a real, public, MIT-licensed artifact you can clone and run today — 75 tests at 98% coverage, the eval gate passing, the wheel building with its data files, and the MCP server passing a smoke test before it shipped.

It is deliberately the differentiator in this lineup: it puts the three signals I lead with — AI-visibility / GEO, MCP, and evals — into one runnable thing rather than a slide. The numbers above are the tool’s own measured results against its gold set; the demo data is synthetic by design.

05 —

Stack

PythonMCPFastMCPuvGitHub Actions

All case studies Ask my AI twin about this