Turning Chatbot Replies into Interactive Artifacts¶

AI agents are getting ridiculously powerful. But most of us still experience them as... chatbots. Text in, text out. Maybe an image if we’re lucky. This is fine for summaries, but falls apart when we need actual analysis, interactive visuals, or domain-specific tools. To solve this, we started building a collection of interactive elements that LLMs can speak into existence, and a client can auto-detect and render. Instead of a wall of text, the same answer becomes a chart, a knowledge graph, a KPI, or anything else we can imagine.

Why chat alone isn’t enough¶

Use an LLM-heavy product for more than a few minutes and you’ll notice a pattern:

The model gives a great answer.
You immediately copy‑paste it somewhere else.

Sheets for analysis, graph tools for relationships, slides for presentations. The real work happens outside the conversation. The chat window is just a text generator. We wanted the chat itself to become the workspace. That meant the model had to return more than paragraphs. It needed to return artifacts: things the UI could render, query, and reuse. Our apprach consists of two pieces:

A vocabulary of artifacts Types like bar_chart, knowledge_graph, kpi, sankey_diagram, etc. Each one has:
- a representative codefence
- a renderer component (React, web component, etc).
A way for the model to emit them We use semantic code fences inside normal chat messages, e.g.:

{ ...JSON... }

{ ...CSV... }

The model doesn't “talk to the UI” directly. It just returns text that contains little islands tagged with artifact types that the frontend understands. It scans for these fences, validates the content against the schema for that type, and hands it off to the right renderer. To the user, it looks like the chatbot reply contains a live chart or graph they can play with.

Semantic code fences in practice¶

You’ve seen normal code fences:

```python
print("hello")
```

We keep the same idea, but the “language” is semantic, instead of programming:

   ```chart
   data + chart spec
   ```

   ```knowledge_graph
   entity relationships data
   ```

   ```kpi
   data for kpi, including styles
   ```

Inside each fence is structured content—often JSON, but not always. For example, medical imaging often uses 'DICOM' (Digital Imaging and Communications in Medicine). A codefence for this might just be a list of image filenames, like so:

```dicom
JohnDoe_mri.jpg
JohnDoe_xray.jpg
```

The basic loop¶

Here’s the end‑to‑end flow:

User asks for something beyond prose
“What is a typical medical journey for patients who recieve an insertable cardiac monitor?”
“Show me the prescriptions for Drug A over the last 3 years.”

The agent picks an artifact type
Instead of describing a chart, it emits:

  ```chart
  {
    "type": "bar",
    "data": {
      "labels": ["2022", "2023", "2024"],
      "datasets": [
        { "label": "Totals", "data": [245000, 198000, 312000] }
      ]
    }
  }
  ```

Bar chart

The client scans for fences On the front-end, we parse the message, find the fences, and look up the type in the artifact registry:
We validate and render If it passes the validation, we hand it to the renderer. If it fails, we fall back to an error + raw response.
The user sees a widget, not a blob To the end user, it’s just a chatbot reply with a live chart or graph embedded.

The chat protocol stays simple (still plain text), but the UI stays in control of what gets rendered.

graph TD
    %% Styling
    classDef user fill:#f9f,stroke:#333,stroke-width:2px;
    classDef system fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
    classDef logic fill:#fff9c4,stroke:#fbc02d,stroke-width:2px;
    classDef success fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef fail fill:#ffebee,stroke:#c62828,stroke-width:2px;

    %% Nodes
    User((User)):::user
    Agent[Agent / LLM]:::system

    subgraph "The Agent Output"
        RawMsg[Message Content<br/>Prose + JSON in Fences]:::system
    end

    subgraph "Client-Side Parsing"
        Parser[Scanner parses message<br/>& identifies Fences]:::logic
        Registry{Lookup Type<br/>in Registry?}:::logic
        Validator{Validate<br/>Schema?}:::logic
    end

    subgraph "Render Logic"
        Renderer[Pass to<br/>Specific Renderer]:::success
        Fallback[Fallback to<br/>Error + Raw Response]:::fail
    end

    FinalUI[Final Interface Display]:::user

    %% Connections
    User -->|Asks for Chart/Data| Agent
    Agent -->|Generates Artifact| RawMsg

    RawMsg -->|Emits JSON| Parser
    Parser --> Registry

    Registry -- Type Found --> Validator
    Registry -- Not Found --> Fallback

    Validator -- Valid --> Renderer
    Validator -- Invalid --> Fallback

    Renderer -->|Widget Rendered| FinalUI
    Fallback -->|Blob/Text Rendered| FinalUI

    %% Note for context
    note1[Example: 'Show me Drug A prescriptions']
    note1 -.-> User

Backed by MCP¶

One piece that makes this scale: each element isn’t just a component—it also has its own MCP tool on the backend. This tool defines the input schema, explains when the agent should call it, and returns a semantic code fence that the agent can inject into the output. So the agent doesn’t actually write the fences itself. It decides when a user's query warrants an element-generating tool call, and gets back a ready‑to‑render fence.

You might be wondering why this cannot just be a giant system prompt. In theory, we could cram everything into one huge system prompt, like describing how to build and render the fences for charts, graphs or other artifacts. However, this doesn't scale well. The more artifacts you introduce, the longer your system prompt gets. This has three main drawbacks:

LLM recall: the LLM (especially a smaller, less powerful one) forgets how to render certain artifacts, or struggles with consistency
Cost: a longer system prompt means more tokens, which means higher api costs
Latency: a longer system prompt means you wait longer for a response

In addition to solving these problems, the MCP + artifacts approach also buys us:

No infinitely growing prompt Each element carries its own “how and when to use me” inside the tool definition. Add a new element → add a new tool. The core prompt stays small.
Clear boundaries A chart tool focuses on chart JSON. A knowledge graph tool focuses on entity-relation-entity triplets. The agent just picks which tool fits.
Shared behavior across agents Any agent that knows about the artifacts gets the same capabilities. No copy‑pasting formatting rules across prompts.
Easy iteration Want to change the chart schema? Update the tool + renderer. Prompts don’t have to be rewritten.

In practice, this means the chat window stops being a dead-end text box and becomes a live surface for work:

A data analyst can ask, “Compare these companies' sales over time,” and immediately get an explorable chart they can sort, refine, and download—not just a paragraph about trends.
A clinician can say, “Show me how drug A affects patients with comorbidity B,” and see an interactive knowledge graph instead of skimming through bullet lists.
A product manager can track KPIs as live widgets right inside the conversation, instead of copy‑pasting numbers into a deck.

Because each artifact is backed by an MCP tool, they are consistent, reusable, and easy to extend. You don’t rebuild capabilities per‑agent or per‑prompt; you add a new artifact type once, wire it in, and every compatible agent can start turning chat into a real workspace—not just a nicer search box.

We are thinking of calling this approach "SAIL: Structured Artifact Interface Library". What do you think?