Building a Tax Copilot: Trustworthy AI for Tax Questions

RAG HYDE Tool Augmentation LLM Architecture Observability

Earnr

UK tax and finance app for the self-employed and their accountants

RAG + HYDE

Retrieval grounded in hypothetical answers, not raw ambiguous questions

Tool augmentation

Calculators as tools: structured outputs replace freehand model arithmetic

Modern language models can converse fluently, but trustworthy answers to tax questions require more than eloquence. This case study outlines a pragmatic, production-oriented approach to building a tax copilot that answers questions and performs calculations reliably. It is based on the approach we implemented for Earnr, a finance and tax app for the self-employed and their accountants in the UK. It focuses on the technology and the philosophy behind the system, not specific implementation details.

Why a Tax Copilot?

Tax rules are nuanced. People ask natural, messy questions. They combine multiple incomes, special allowances, edge cases, and what-ifs. A tax copilot should:

Provide clear, accurate explanations with citations.
Perform calculations reliably and explain the reasoning for the correct tax year.
Ask clarifying questions when information is missing or ambiguous.
Minimise hallucinations by grounding answers in trusted sources.

Architecture at a Glance

At a high level, the copilot either executes a tool (calculator) when it detects a computational task, or retrieves relevant context and explains the answer, optionally using a browsing LLM for live web-grounded responses.

Figure 1 — Overall architecture: the copilot routes between tool execution and retrieval-augmented generation depending on intent.

Retrieval-Augmented Generation (RAG)

RAG reduces hallucinations by giving the model a curated context. Documents (help articles, tax guides, FAQs) are embedded into vectors and stored in a vector database. At question time, the most similar snippets are retrieved to build a context prompt, and the model is asked to answer using that context and cite references where possible.

Practical implementation choices:

Chunking: Prefer chunks that preserve semantic coherence. Entire short articles are often better than many tiny slices. Use metadata: source, URL, tags, timestamps.
Hybrid retrieval: Pair vector search with full-text search (FTS). FTS catches exact matches (rates, codes, acronyms) that vectors may miss.
Ensemble ranking: Combine signals (vector scores, FTS ranks, recency, source trust) into a final ranked list.

Hypothetical Document Embeddings (HYDE)

HYDE improves retrieval by embedding not just the question but also a hypothetical answer for vector search. The question is rephrased to be standalone, a short hypothetical answer is generated, and documents similar to that hypothetical answer are retrieved.

Documents in "answer space" are often more coherent than documents matched to short, ambiguous questions that a typical customer is likely to ask. HYDE is a technique from research, implemented here using standard LLMs for the rephrase and answer steps.

Vector Database and Full-Text Search

Vector DB: Stores dense embeddings for semantic similarity. Great for matching concepts even when wording differs.
Full-text search: Fast keyword search with ranking. Excellent for exact terms: thresholds, bands, statutory phrases.
Use both: Build a hybrid retrieval layer and ensemble-rank the results. It significantly improves relevance and trust.

Optional Web-Grounded Answers

Sometimes the best answer is on the open web, for example brand-new guidance relevant for the next tax year. Here we either call a browsing LLM that fetches sources and synthesises an answer, or run a lightweight search, fetch, and extract pipeline over a curated set of trusted domains.

This path is used selectively:

Favour internal knowledge for stability and consistency.
Use web answers when the internal knowledge is insufficient for a particular context.
Clearly cite sources for trust and explainability.

Tool-Augmented Generation (Calculators)

For tax, doing the math reliably is crucial. Relying on the model to calculate is risky. Instead, calculators are defined as tools (functions) with strict JSON schemas covering inputs, constraints, and descriptions. The model chooses and calls tools when the user asks for a computation, and returns structured outputs (band splits, tax subtotals, totals) which are then post-processed into human explanations.

Example tools (illustrative):

Capital gains tax
Take-home pay, income tax, National Insurance
Tax-free allowance
Mortgage amortisation

The model is instructed to explain results from tool outputs only, without resorting to freehand arithmetic. Inputs are validated against schemas, errors are handled gracefully, and clarifying questions are asked when required inputs are missing or ambiguous.

Orchestration Loop

Rephrase the user's question and remove chat dependencies.
If the question implies a calculation, select an appropriate tool and execute it.
If the question seeks explanation, retrieve context using HYDE and hybrid search.
If web freshness is necessary, consider a web-grounded path.
Generate and stream the answer, including citations when available.
If anything is missing or ambiguous, ask for clarification.

Observability, Guardrails, and Costs

Observability: Track token usage, response times, and failure modes. Aggregate per feature and per model to tune prompts and routing.
Guardrails: System prompts that forbid unsupported assumptions, input validation, and clear error messaging.
Latency: Stream tokens for perceived responsiveness. Limit context length, cache frequent embeddings, pre-rank likely documents.
Model routing: Use a primary and alternate model policy for cost versus quality. Consider smaller models for classification and selection, larger for final answers.

Privacy and Security

Store only what you need. Minimise PII and mask or drop sensitive fields.
Authenticate access to retrieval and tools. Use project-scoped API keys and per-environment secrets.
Log safely: redact inputs and headers. Align with data retention policies.

End-to-End Flow

Figure 2 — End-to-end sequence: the orchestrator routes between calculator tools and retrieval-augmented generation, streaming the final answer to the user.

Closing Thoughts

A credible tax copilot blends generative language with grounded knowledge and reliable tools. RAG reduces hallucination, HYDE improves retrieval, hybrid search increases recall, and calculators turn natural language into precise math. Add careful orchestration, observability, and principled guardrails, and you have an assistant that is helpful, trustworthy, and fast enough to feel responsive.

HYDE refers to Hypothetical Document Embeddings, a technique from research in which any modern LLM generates the hypothetical answer used for retrieval. A browsing LLM denotes an API that integrates search and page fetching during generation.