We built a document platform where an AI assistant designs marketing documents (flyers, brochures, one-pagers) inside brand-approved rails, and humans finish them by clicking into the rendered page and typing. Getting the agent to author templates, the structural layouts those documents are built from, turned out to hinge on a single unfashionable decision:
Our template format is HTML. And the agent's main editing tool is "rewrite the whole thing."
That inverts most of the current advice about building agents on structured data, so this post is about why we did it, what it bought us, and the scars we picked up along the way.
The Problem
A document template in our system is a tree: a document contains pages, pages contain blocks, blocks contain text atoms, image atoms, styled containers, and slots that reference reusable widgets. Every node carries attributes: CSS classes, length budgets for copy, type-scale choices, slot constraints. Themes paint the tree through CSS variables, so a template never hardcodes a color; it says bg-primary and the active brand theme decides what that means.
We wanted an assistant that could build and restructure these trees conversationally: "Design a full-width widget with a rounded content box on the left and a stat panel on the right." And we wanted its output to land in the same editor, with the same undo semantics and the same validation, as a human's edits.
The Obvious Design, and Why We Didn't Ship It
The textbook approach is to store the tree as JSON and give the model a granular tool API:
insertNode(parentId, type, index)
setAttribute(nodeId, key, value)
moveNode(nodeId, newParentId, index)
removeNode(nodeId)We've built agents like this. They work, but they under-perform in three predictable ways:
- You're teaching a bespoke schema from scratch. Every node type, every attribute, every containment rule has to be spelled out in the prompt, and the model's only fluency is whatever your prompt bought. It has seen your JSON schema zero times in training.
- Granular tools invite granular failure. Building a twelve-node layout takes a dozen round trips. Each call can reference a stale id, a wrong parent, an index that shifted two calls ago. The tree passes through eleven intermediate states, each a chance to strand the agent somewhere invalid, and each a state your renderer might have to survive.
- The model can't "see" its work. With mutation-by-tool-call, the model's picture of the current tree is a mental reconstruction from its own call history. Drift is inevitable.
The Inversion
Our templates serialize to plain HTML with a small attribute grammar:
<div data-type="block" data-name="feature-callout" id="wgt-root">
<div data-type="container-atom" data-name="content-box" id="wgt-content"
data-container-classes="rounded-2xl flex-[2] bg-primary p-6">
<div data-type="text-atom" data-name="heading" id="wgt-heading"
data-text-element-tag="h2" data-text-style="heading-2"
data-max-length="60"></div>
</div>
</div>Every node is an element. data-type names the node kind, data-name is a stable human label, data-* carries attributes, and a parser converts this to and from the internal AST. This predates our agent work; it existed so templates could round-trip through a human-editable markup view.
When we built the template-authoring agent, we made its primary tool embarrassingly blunt:
set_template_markup(markup: string, summary: string)
// "Replace the ENTIRE template with new markup."That's it. The model reads the current markup (sent fresh with every request, so it always edits what the user actually sees), writes the complete new tree, and the client validates and applies it in one shot.
The result surprised us with how little prompting it needed. LLMs have deep, pre-trained fluency in HTML: nesting, attributes, class strings, ids. We didn't teach a format; we borrowed one the model already speaks natively. The prompt spends its budget on our semantics (what a widget-slot means, which theme tokens exist, how length budgets work) instead of on syntax. A grammar digest generated from the same config file the visual editor uses keeps the two from drifting.
And because each edit is a whole tree, there are no intermediate states. The edit is coherent or it's rejected, one parse at the boundary:
const parsed = parseMarkupToAst(markup);
if (parsed?.type !== "block") reject("widgets need a single block root");
editor.markupCurrent = format(build(parsed)); // same funnel as human editsEvery agent edit flows through the exact pipeline every human edit uses. Same validation, same undo, same reactive preview. The agent isn't a privileged actor with its own write path; it's just another author.
The Escape Hatch
Full-tree rewrites are wasteful for one-attribute tweaks, and rewriting sixty nodes to change one class invites transcription drift in the other fifty-nine. So there's exactly one surgical tool:
set_node_attributes(nodeId: string, attributes: Record<string, value>)Two tools total for structural editing. In practice the model picks correctly without guidance: big changes get a rewrite, tweaks get a patch. Compare that with the tool-catalog approach, where the model must choose among a dozen mutations and compose them correctly.
The same grammar then paid a second dividend: when we later let the document assistant mint brand-new widgets mid-conversation, it authored them in the same markup, validated by the same parser, with no new format to teach.
The Scar Tissue
This is the part most posts skip. Three real bugs, all instructive:
1. Ids are sacred, and our formatter wasn't treating them that way. Content in our system is matched to template nodes by id, and the agent references nodes by the ids it authored ("wgt-wrapper"). Our HTML formatter had an old rule: regenerate any id shorter than 21 characters (the length of our generated nanoids). So the agent's readable ids were silently rewritten to random strings on the first format pass, and its follow-up set_node_attributes("wgt-wrapper", …) failed with "No node with id." A user hit this in the first real session. The fix was one line (only generate an id when one is missing), but the lesson generalizes: if agents reference identifiers, every pass in your pipeline must preserve them byte-for-byte. Normalization steps written for humans will betray you.
2. HTML attributes are stringly typed; your AST probably isn't. Our parser coerces "true" → true and "1.5" → 1.5 on read. Useful for humans, hazardous in general: a version string like "1.5" becomes the number 1.5, and a text attribute that happens to look numeric gets type-bent on every round trip. We maintain an opt-out list of string-only attributes, which means every new attribute is a latent bug until someone remembers the list. If we started over, coercion would be per-attribute and declared in the node config, not inferred.
3. Markup is the interchange format, not the storage format. Our blobs actually store the parsed AST as JSON. Markup exists at exactly two boundaries: the human markup editor and the agent's tool I/O. This matters more than it sounds: our servers have no DOMParser, so all markup→AST conversion happens client-side at those boundaries, and everything downstream (slot resolution, rendering, content matching) works on typed JSON. "AST-as-HTML" really means HTML as the authoring dialect of the AST, with one guarded door between them. Blur that line and you'll end up parsing HTML in places that can't.
One more discipline that earns its keep: round-trip property tests. parse(build(tree)) must preserve ids, names, and attributes exactly. The formatter bug above would have been caught by a five-line test we only wrote afterward.
When You Shouldn't Do This
- Deeply typed or numeric-heavy trees. If your nodes are mostly floats, enums, and cross-references, HTML's stringly attributes fight you. This works because document layout is already HTML-shaped.
- Huge documents. Full-tree rewrites scale with tree size. Ours are bounded (a template is a few dozen nodes); a 10,000-node scene graph would need the granular API after all, or chunked rewrites.
- Trees the model shouldn't fully see. Whole-tree I/O assumes the whole tree fits in context and is safe to show.
- Multi-writer concurrency. "Replace everything" is last-write-wins by construction. Fine for one user plus one assistant in a session; wrong for real-time collaboration.
Takeaways
- Choose formats the model already speaks. Fluency you don't have to prompt for is the cheapest capability you will ever ship.
- Prefer one coherent edit over many granular ones. Whole-artifact rewrites eliminate intermediate invalid states and the stale-reference bugs that plague mutation APIs. Add one surgical tool, not twelve.
- Validate at the boundary, then reuse the human pipeline. The agent should be an ordinary author flowing through the same funnel as everyone else, not a privileged actor with a private write path.
- Treat identifiers as a contract. Anything the agent can reference later must survive every formatter, normalizer, and round trip unchanged.
- Keep the friendly format at the edges. Store typed data; expose the ergonomic dialect only where humans and models actually author.
The unfashionable summary: we got a better agent by giving it less API and more HTML.