Structured data for AI search

Search the phrase "AEO" or "GEO" or "LLM optimisation" and you will find a fast-growing industry selling structured-data tricks that promise to get your brand recommended by ChatGPT. Most of it is the optimisation that isn't: ritual markup with no mechanism behind it. The useful core is real, but it is smaller, duller, and mostly the same technical hygiene that has always made a site legible to machines.

This is the version I actually implemented on this site, told honestly. Structured data genuinely helps an AI answer engine understand who you are and what a page is about, which raises the odds it cites you accurately. What it cannot do is make a model prefer you. Separating those two claims is the whole game, because the gap between them is where the hacks live.

The worked example throughout is this site: a small static marketing site on Cloudflare Pages with a handful of Lab articles. The exact stack does not matter. What matters is which structured-data decisions earn their place and which are theatre, so that is what this is built around.

// TERMS, ONCE

AEO / GEO: "answer engine optimisation" and "generative engine optimisation", the marketing labels for getting cited inside AI answers. I use "AI search" for the surface and avoid the acronyms otherwise.

The surfaces: Google's AI Overviews and AI Mode, and the retrieval modes of ChatGPT, Claude, Perplexity and Gemini. They share one habit: they fetch live pages, ground an answer in them, and link the sources.

What AI search actually reads

Before optimising anything, it helps to be precise about how these systems use a page, because the popular advice usually assumes a mechanism that does not exist. An AI answer engine does two separate things, and only one of them touches your structured data.

It retrieves and reads. When a model answers a live query, it fetches a set of candidate pages and grounds its answer in their content. For Google's AI Overviews this is ordinary Googlebot crawling and ordinary ranking; the AI layer sits on top of the same index. For ChatGPT, Claude and Perplexity it is a retrieval bot fetching the page in real time. In both cases the thing being read is mostly your visible copy, with structured data and meta tags as machine-readable hints about entities and page type.

It decides whether to cite. Having read the candidates, the system picks which to quote and link. That decision is driven by relevance and clarity: can it find a clean, specific sentence that answers the question, attributed to a source it can identify and trust. Structured data feeds the "identify" part (who published this, who wrote it, what kind of page it is) but it does not override relevance. No amount of schema makes an off-topic page the answer.

Layer	What it carries	What it does for AI search
Visible copy	headings, paragraphs, lists, tables	the substance that gets quoted and cited
JSON-LD	entities, author, publisher, page type	disambiguation: who and what, not whether
Meta description	one-line summary of the page	a ready-made extract the grounder can lift
Open Graph	share title, image, type	human share previews; little direct AI signal
robots.txt	crawl permissions per bot	whether the retrieval bot is allowed in at all

Read top to bottom, the order is deliberate. The copy does the work, the structured data disambiguates it, and the crawl policy decides whether any of it is reachable. Most "GEO" advice inverts this, treating schema as the lever and the copy as an afterthought. It is the wrong way round.

The optimisation that isn't

It is worth stating the negative case plainly, because a lot of effort gets spent here for nothing. Three of the most-sold ideas do not do what they claim.

llms.txt does not help with Google. The idea of an llms.txt file, a curated map of your content for language models, is appealing and has a tidy spec. Google has said directly that its AI features need no special markup, no llms.txt, and no AI-specific signals; regular crawling is sufficient. I have not seen enough evidence that the major assistants reliably use the file as a visibility input either. I did not ship one, and nothing about the site's AI visibility depends on it. If that changes, it is a five-minute file to add; until then it is a maintenance liability that lies the moment the content drifts from it.

Schema does not buy a recommendation. Marking up a page as a Product with five-star AggregateRating you invented, or stuffing a FAQPage with questions nobody asked, does not make a model endorse you. It makes your structured data dishonest, which is the one thing search systems are explicitly built to catch and demote. The schema describes the page; it cannot make a claim the page does not support.

There is no keyword density for LLMs. The old habit of repeating a target phrase transfers badly. Retrieval systems work on meaning, not term frequency, and a paragraph written to hit a keyword count reads as exactly that to both a human and a model. The thing that gets quoted is a clear, specific, self-contained sentence, which keyword-padding actively destroys.

Structured data helps a model understand who you are and what a page is. It does not make the model prefer you. Every technique that promises the second thing is selling the first with a markup.

What is left after you strike out the theatre is a short list of things that genuinely help, and they are the subject of the rest of this article: a coherent entity graph, honest page-type schema, copy a retrieval system can lift cleanly, and a crawl policy that lets the right bots in. None of it is exotic. All of it is the kind of work that also just makes the site correct.

The entity graph: one id per thing

This is the single highest-leverage structured-data decision, and it is the one most sites get wrong. The goal is for a machine to resolve every reference to you (the author, the company) to one stable identity, rather than guessing whether the "Chris Hay" on this article is the same "Chris Hay" named as the company founder on the homepage. You do that with the schema.org @id mechanism: define each real-world entity once, give it a stable @id URL, and reference that @id everywhere else instead of repeating the data.

The failure mode is concrete. A company that writes its name three slightly different ways across a marketing subdomain, a careers site, and a docs portal, each with its own unlinked author and publisher blocks, hands a search engine three weak half-entities to reconcile instead of one strong one. The signal that should have compounded onto a single identity gets split three ways, and none of the three clears the bar to be recognised. Single @id resolution is how you avoid fragmenting your own authority.

On this site there are two anchor entities. The author Person is defined once, on /about/, carrying the full detail: job title, the topics he knows about, awards, image, and the sameAs links to external profiles that let Google reconcile the entity against its Knowledge Graph. The Organization is defined once on the homepage, with its legal identifier, founder, and service catalogue. Every other page, including all twelve Lab articles, references those two by @id rather than restating them.

// json-ld · every article references the entities, never redefines them

{
  "@type": "TechArticle",
  "author":    { "@id": "https://performify.co.uk/about/#chris-hay" },
  "publisher": { "@id": "https://performify.co.uk/#organization" }
}

// ...resolving to one Person node, defined once on /about/
{
  "@type": "Person",
  "@id": "https://performify.co.uk/about/#chris-hay",
  "name": "Chris Hay",
  "jobTitle": "Founder",
  "knowsAbout": ["Attribution", "Paid media", "Marketing data engineering"],
  "sameAs": ["https://www.linkedin.com/in/haychris"]
}

The payoff is that author authority compounds onto one node. Every article that references the same Person @id adds to a single entity's track record on its declared topics, rather than scattering thin signal across a dozen lookalike author strings. The same hinge points back from the homepage: the Organization's founder field references the same Person @id, so the company and the author are explicitly the same story to a machine.

Define each real entity once, reference it by @id everywhere else. Author and publisher signal then compounds onto one node instead of scattering across lookalike copies.

Article schema that earns its place

Below the entity layer, each article carries a small, honest set of structured data describing the page itself. The job is disambiguation, not decoration: tell a machine what kind of page this is, who wrote it, when, and where it sits in the site. Three node types do all of it.

An Article (or TechArticle) node with a headline that matches the visible H1, a description that matches the meta description, the author and publisher @id references, and datePublished / dateModified. TechArticle over Article only where the content is genuinely technical.
A BreadcrumbList mirroring the visible breadcrumb, so the page's place in the hierarchy is explicit rather than inferred from the URL.
A WebSite reference via isPartOf, tying the page back to the one site entity.

The discipline that makes this work is consistency. The headline in the JSON-LD is the same string as the visible H1 and the breadcrumb leaf; the structured description is the same string as the meta description; the dates in the JSON-LD match the visible "published" and "last updated" stamps. When the machine-readable claims and the human-readable page agree exactly, a grounder has no reason to distrust either. When they drift, you have taught it that your structured data lies, which is worse than having none. A validator pass (Google's Rich Results Test, or the schema.org validator) before publish is the cheap insurance.

This is the same baseline the Lighthouse SEO category rewards, for the same reason: a clean title, a valid canonical, structured data that parses. Get the publishing checklist right and the AI-legibility comes mostly for free, because both audiences want the same thing, which is an unambiguous page.

HowTo, and when not to use it

A HowTo node exposes each step of a procedure as an independently extractable unit, which is genuinely useful for a grounder answering a "how do I" question: it can lift one clean step rather than parse the whole article. On this site the procedural Methods carry it, and the steps in the JSON-LD map one-to-one to the H2 anchors in the body.

The important half of the rule is the restraint. HowTo belongs only on articles that are actually a procedure. Bolting it onto a conceptual or opinion piece to look more "extractable" is exactly the kind of AEO trick that backfires: the steps do not correspond to anything real, the markup contradicts the page, and you have spent effort making your structured data less trustworthy. I left it off the conceptual articles deliberately, and, to be consistent with my own argument, I left it off this one. This article is a position, not a recipe, so it carries Article, WebSite and BreadcrumbList and nothing more.

// DON'T

Don't add HowTo, FAQPage, or AggregateRating to a page just because they are "rich" types. Schema that does not describe the page is the one structured-data move that can actively hurt you, because matching content to markup is exactly what search systems verify. Match the type to the page or leave it off.

Copy a retrieval system can lift

This is where most of the real "AI optimisation" actually lives, and it is a writing job, not a markup one. A retrieval system cites the source it can quote most cleanly, so the practical target is to make sure every page has sentences that stand on their own: specific, self-contained, and answering a real question without needing the paragraph around them for context.

One of the highest-value summaries on the page for this purpose is the meta description, because it is a summary you wrote yourself, sitting in a slot the grounder already reads. The temptation is to spend it on brand throat-clearing. The better use is to lead with the concrete operator outcome the page delivers, so a system extracting a one-line answer gets something worth quoting. The same principle applies to the first sentence under each heading and to the card deks on the index: lead with the specific thing, name real numbers, tools and failure modes, and skip the wind-up.

Front-load the answer. Put the conclusion in the first sentence of a section, then support it. Grounders quote openings.
Be specific over fluent. "Cut cost per customer 4.4x in 30 days" is quotable; "drove significant efficiency improvements" is not.
Make sentences self-contained. A sentence that needs the previous three to make sense cannot be lifted as a citation.
Say the thing you are not. Negative-space framing ("not another static report") is a strong relevance signal precisely because it is hard to fake.

None of this is a trick for machines. It is just clear writing, which is why it ages well: the same sentence that an AI grounder finds easy to quote is the one a human skim-reader finds easy to understand.

Internal linking, by topic cluster

The entity graph links your identities; internal links link your ideas. Grouping related articles into topic clusters and linking within them does the on-page-content version of the same job: it tells a reader, and a retrieval system, that this is a coherent body of work on a subject rather than a scatter of unrelated posts. A page that sits inside a well-linked cluster is easier to reach, easier to place in context, and easier to read as one credible source among several on the same topic.

On this site the mechanics are deliberately mundane. Every article carries a few hand-picked related links chosen by thematic adjacency, and I keep a simple map of which cluster each piece belongs to, so a new article gets woven into the right neighbours rather than bolted on. Every new piece earns inbound links from its cluster, no article is left orphaned, and no single popular piece is allowed to absorb every link. It is the same discipline as the entity graph, applied to content instead of identity: connect the things that genuinely belong together, once, on purpose.

Group related articles into clusters and link within them. Bridge clusters only where ideas genuinely connect, and leave nothing orphaned. This is the content-layer counterpart of the entity graph above.

Be wary of the strong version of this claim. "Topical authority" gets sold as a dial you can turn by publishing a ring of thin pages that cross-link each other, which is just the keyword-padding trick from earlier wearing a different costume. The durable benefit is simpler and quieter: a real, well-linked cluster helps a human navigate and helps a grounder see depth where depth exists. It is worth doing on its own terms, not as a ranking cheat code.

Crawl access: let the bots in

All the structured data in the world is inert if the retrieval bot is blocked at the door, and this is a setting many sites get wrong by default. A lot of platforms, including Cloudflare at the zone level, now offer a one-click "block AI bots" control, and plenty of sites have it on without having made the decision deliberately. It is worth making on purpose, because there are two different kinds of AI bot and they do not deserve the same answer.

Retrieval bots (OAI-SearchBot, PerplexityBot and similar) fetch a page to ground a live answer and link back to it. For anyone whose credibility surface is being cited as the expert source, this is a high-quality awareness channel reaching exactly the right reader. Google is the exception worth stating precisely: its AI Overviews and AI Mode run on ordinary Googlebot crawling and normal Search visibility, not a separate AI bot. Google-Extended is a distinct control for Gemini training and grounding in some other Google products, not the switch for AI Overviews, so blocking it does not change whether you appear there.
Training bots (GPTBot, ClaudeBot, CCBot) ingest content into model corpora with no referral. The concern there is real, but it mostly applies to high-volume publishers whose content is the product, not to a consultancy whose articles are business-development artefacts already.

A blanket block does not distinguish the two; it opts you out of the valuable retrieval citations to avoid the lower-value training ingestion. On this site I turned the blanket block off and allowed both, under a plain wildcard rule in robots.txt that disallows only private paths. There is no llms.txt and no AI-specific markup, because, as above, the surfaces that matter do not read them. The decision is one toggle to reverse if the calculus ever changes, but for a practice that wants to be the cited source, being absent from those answers is the more expensive option.

// CHECK YOUR ROBOTS

Fetch your own robots.txt in production (the edge may inject rules your repo file does not show). If it carries Disallow: / blocks for GPTBot, PerplexityBot and friends, that is a decision someone should own, not a default to inherit.

Open Graph: the share layer

Open Graph and Twitter Card tags belong in this conversation mostly so they can be placed correctly: they are the human share-preview layer, not an AI-ranking lever. When someone drops your link into LinkedIn, Slack or a messaging app, OG is what renders the card. That matters for the click-through that puts a human on the page, which is a different and still valuable funnel from being quoted inside an answer. Treat it as the social presentation layer and do not expect schema-grade AI signal from it.

The discipline that keeps it from quietly breaking is consistency, the same theme as the schema layer. Three rules cover almost all of it: the OG title equals the Twitter title equals the page <title>; the OG description equals the Twitter description equals the meta description; and the brand name is presented identically everywhere. On this site the share cards themselves are generated from a single template so every article's card is visually consistent and none is hand-built, which removes a whole class of "the card looks wrong" bugs.

The one genuinely sharp edge is caching. Share-card images are served with long, immutable cache lifetimes on a stable filename, and social scrapers cache aggressively on top of that. Regenerating a card over the same filename does not propagate; the old image can persist on LinkedIn and WhatsApp for a long time. The fix is to bump a version query on the image URL when you change it and re-scrape through the platform's sharing debugger. It is a small thing that looks like a bug for weeks if you skip it.

What I deliberately didn't do

Knowing what to leave out is most of the honesty in this topic, because the pressure is all in the other direction. Here is what I skipped, and why none of it cost anything.

No llms.txt. Google does not read it, no assistant commits to it, and a content map that drifts from the content is a liability. If a major surface starts honouring it, it is a five-minute addition.
No invented FAQPage or AggregateRating. Rich types only where the page genuinely is that type. Fabricated ratings and questions are the fastest way to mark your own structured data as untrustworthy.
No HowTo on conceptual pages. Including this one. The markup has to describe a real procedure or it is noise that contradicts the page.
No AI-specific copy or cloaking. No separate "for the LLM" text, no hidden summaries. One page, the same for the human and the machine; the clarity that helps one helps the other.
No keyword-tuned padding. Sentences are written to be specific and quotable, not to hit a phrase count.

// THE WHOLE METHOD AS A CHECKLIST

Everything above, as a pass to run before publishing a page:

One stable Person @id and one stable Organization @id, referenced everywhere rather than restated.
Article / TechArticle only where it fits the page; HowTo only for a real procedure.
JSON-LD headline matches the visible H1; structured description matches the meta description.
Meta description leads with a concrete, quotable outcome, not brand throat-clearing.
The page sits in a topic cluster with real inbound and outbound links, not orphaned.
robots.txt checked in production; retrieval bots allowed in.
No invented FAQPage, AggregateRating, fake HowTo, or llms.txt.

The honest summary is that "structured data for AI search" is mostly a discipline problem wearing a novel name. Define your entities once and reference them by @id; describe each page with schema that matches what is actually on it; write copy a retrieval system can quote cleanly; and let the right bots crawl it. Do that and you are legible to AI answer engines for the same reason you are legible to a careful human reader. Everything past that line, the files Google ignores, the rich types you do not qualify for, the markup that promises a recommendation, is the optimisation that isn't. The most useful thing I did was decline to ship it.

Structured data for AI search, and the optimisation that isn't