⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

TinyTeapot 🫖

Website | Try out our Demo | Join the discussion on Discord

TinyTeapot is a lightweight (~77M parameter) grounded language model optimized for low-latency, hallucination-resistant question answering and RAG workflows. Building on our prior work, TeapotLLM, TinyTeapot delivers strong context-faithful performance while being ~10× faster on CPU, making it ideal for real-time, on-device, and cost-efficient deployments across CPUs, mobile devices, and other resource-constrained environments.

TinyTeapot is distilled from our previous model, TeapotLLM, and trained on grounded datasets including SynthQA, a context-focused QnA and extraction benchmark, and TeapotChat for instruction-following and grounded dialogue. This distillation transfers TeapotLLM’s refusal behavior, structured extraction patterns, and context-only answering into a significantly smaller, edge-efficient model.

Hallucination Resistance

Through distillation from TeapotLLM and training on SynthQA, TinyTeapot learns to refuse questions when the answer is not present in the context, improving reliability compared to similarly sized models in RAG and document QA pipelines.

Training & Evaluation

🚀 ~10x Faster CPU Inference than TeapotLLM with Strong Grounded Performance

TinyTeapot is evaluated primarily against its teacher model (TeapotLLM) on SynthQA-style grounded tasks, with additional comparisons to larger instruction-tuned models such as LLaMA and Qwen to contextualize efficiency vs quality tradeoffs.

Task Performance vs TeapotLLM

The chart below shows task-level similarity across boolean reasoning, QA, extraction, summarization, and unanswerable queries. While smaller (~77M vs ~800M+), TinyTeapot retains strong grounded behavior due to distillation from TeapotLLM and training on SynthQA and TeapotChat.

TeapotLLM remains the strongest overall teacher model
TinyTeapot maintains high QA and summarization similarity for its size
Strong refusal performance on unanswerable questions due to grounded training
Significant efficiency gains with minimal degradation on core grounded tasks

Latency vs Answer Similarity (CPU & Accelerator)

TinyTeapot is designed for real-time and edge deployments where latency is critical. Benchmarks on Google Colab (100 runs) show that TinyTeapot delivers competitive grounded similarity while being dramatically faster than larger models.

Key observations:

3.3s Average CPU latency for TinyTeapot vs 38s for TeapotLLM (~10x faster)
Competitive similarity despite 10x smaller parameter count
Significantly faster and more accurate than other SOTA ~1b parameter models in CPU-constrained settings
Sub-second accelerator latency while maintaining grounded response quality

These results position TinyTeapot as a high-efficiency distilled model that preserves TeapotLLM’s grounded reasoning while enabling low-cost, low-latency deployment.

Getting Started

TinyTeapot can be used directly with Hugging Face Transformers or with our python library teapotai and follows a text-to-text format. It performs best when given explicit grounding context and the pre-trained system prompt.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_NAME = "teapotai/tinyteapot"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Context
context = (
    "The Eiffel Tower is a wrought iron lattice tower in Paris, France. "
    "It was designed by Gustave Eiffel and completed in 1889. "
    "It stands at a height of 330 meters and is one of the most recognizable "
    "structures in the world."
)

# System prompt
system_prompt = (
    "You are Teapot, an open-source AI assistant optimized for low-end devices, "
    "providing short, accurate responses without hallucinating while excelling at "
    "information extraction and text summarization. "
    "If the context does not answer the question, reply exactly: "
    "'I am sorry but I don't have any information on that'."
)

def ask(question: str):
    prompt = f"{context}\n{system_prompt}\n{question}\n"
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        do_sample=False
    )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"{question}")
    print(f"{answer}\n")


# Test 1: Grounded question (should use context)
ask("How tall is the Eiffel Tower") # => 330 meters

# Test 2: Out-of-context question (tests hallucination resistance)
ask("How tall is the Death Star") # => Sorry, I don't have any information on the Death Star.

Recommended Use Cases

Retrieval-Augmented Generation (RAG)
Document question answering
Information extraction
On-device assistants
Mobile and edge inference
Low-latency production pipelines

TinyTeapot performs best when paired with retrieval or structured context inputs rather than open-ended chat.

Limitations and Risks

TinyTeapot is optimized for grounded QnA, RAG, and extraction. It is not intended for open-ended chat, creative writing, or deep multi-step reasoning. Due to its small size (~77M parameters), performance is highly dependent on the quality and relevance of the provided context and may be more prone to hallucinations.

Questions, Feature Requests?

We hope you find TinyTeapot useful and are continuously improving the TeapotAI ecosystem. Please reach out on our Discord for technical help, feedback, or feature requests. We look forward to seeing what the community builds!

License

MIT License ⠀

Downloads last month: 1,448

Safetensors

Model size

77M params

Tensor type

F32

Model tree for teapotai/tinyteapot

Base model

google/flan-t5-small

Finetuned

(475)

this model

Datasets used to train teapotai/tinyteapot

Space using teapotai/tinyteapot 1

Collection including teapotai/tinyteapot

Teapot AI Models

Collection

2 items • Updated 1 day ago