Structured XML and GraphRAG: smarter AI for scholarly publishing

The past year has seen an explosion of interest in generative AI across scholarly publishing. Editorial teams, product leads, and technology partners are racing to test AI-driven tools for peer review, metadata enrichment, classification, and chatbots. But with this innovation has come fragmentation: disconnected pilots, brittle scripts, and inconsistent results.

In his recent talk at the Society for Scholarly Publishing Annual Meeting in Baltimore, Colin O’Neil of SiteFusion ProConsult captured the current state aptly: “AI is everywhere—and nowhere.” The technology is powerful, but without strategy, it risks becoming another layer of chaos. And this aligns with our believe that the antidote to AI confusion is structure. And structure begins with content.

Why probabilistic AI isn’t enough

Most generative AI tools—including today’s large language models (LLMs)—are probabilistic. They generate output based on what’s statistically likely, not what’s verified or accurate. That might work for writing marketing copy, but it doesn’t cut it in scholarly publishing, where every assertion needs to be traceable and every decision defensible.

To move from probabilistic guesswork to fact-based generation, publishers need to rethink not just the AI tools they adopt, but the foundation those tools rest on. As Colin’s framework shows that foundation includes:

Structured content (like JATS, DITA, or NISO STS XML)
Semantic metadata and taxonomies
Governed workflows using process orchestration (BPMN)
Knowledge graphs to inform AI retrieval (GraphRAG)

What is GraphRAG?

GraphRAG—short for Graph-based Retrieval-Augmented Generation—is an emerging method that brings structure and control to generative AI. It builds on traditional RAG techniques, where an AI model retrieves relevant documents before generating an answer, but adds a crucial difference: the retrieval is guided by a knowledge graph rather than a flat vector database.

Here’s what makes GraphRAG so powerful in scholarly publishing:

Semantic Retrieval: Instead of fuzzy matching based on keyword similarity, GraphRAG navigates relationships between entities—such as Author → Institution → Funder or Section → Figure → Caption → Alt Text. These connections are based on structured taxonomies and ontologies.

Deterministic Scope: Retrieval is constrained to trusted, approved nodes. Only the content explicitly modeled in your XML and metadata is considered. That means no hallucinations, no improvisation.

Explainability: Each output can be traced back through a path in the graph, allowing editorial teams to understand and validate what the AI is basing its response on.

By linking structured XML content into a semantically rich graph, GraphRAG ensures that AI doesn’t just “sound smart”—it is smart, in a transparent, auditable way.

XML: the engine behind deterministic AI

You can’t talk about trustworthy, explainable AI in publishing without talking about XML.

Structured XML content provides the deterministic, semantically rich foundation AI needs. It contains well-defined metadata, semantic tagging, and document hierarchies—all of which are crucial when building a knowledge graph or defining retrieval boundaries for AI.

In short: XML is what turns “guessing AI” into “grounded AI.”

As we outlined in our earlier posts—“Feeding AI Models with Structured Content” and “Understanding Structured Content”—structured content transforms your documents from static files into intelligent data sources. It provides context to the content. AI can’t safely scale in publishing without this layer of clarity and control.

From fragmented pilots to governed systems

One of the standout insights from SiteFusion’s approach is the use of BPMN-based orchestration. Every AI task—whether summarizing peer review, answering author questions, or classifying content—is embedded into a structured, auditable workflow.

This means:

Inputs and outputs are traceable.
Human-in-the-loop approval steps are built in.
AI doesn’t operate in a black box—it’s a transparent participant in the publishing process.

This level of orchestration only works when the content being processed is structured. If your manuscript files, reviewer notes, or metadata aren’t consistently encoded (e.g., in XML), you’re left patching together APIs and hoping for the best.

A smarter path forward for scholarly publishers

So, where should publishers start?

Map your editorial and production workflows using process modeling tools.
Structure your content in formats like JATS or NISO STS to expose metadata and semantics.
Model relationships in a knowledge graph, using existing XML and taxonomy data.
Orchestrate AI interactions through workflows that define scope, ownership, and fallback steps.

This doesn’t require rebuilding your entire tech stack. But it does require intent—and structure.

Clarity begins with content

AI isn’t a magic wand. It’s a tool that reflects the quality and structure of the data it’s fed. In scholarly publishing, that means XML.

Colin O’Neil’s message is clear: If you want AI that is scalable, explainable, and editorially trustworthy, you need more than a chatbot. You need a system. And that system begins with structured content.

Maarten van Vulpen

Customer Success Manager at Fonto – Passionate runner and Dad

From chaos to clarity: why structured content is the key to responsible AI in scholarly publishing