Last updated: 14 March 2026 · Methodology v2.0.0
TokenTrace estimates the inference carbon footprint of AI queries using a combination of published research, publicly available data, and reasonable assumptions. This page explains exactly how we do it, where the numbers come from, and where we think the limitations are.
Contents
For text-based AI queries, the carbon footprint is calculated as:
Where:
region_multiplier scaled the carbon
estimate based on grid location. In v2.0, provider-specific grid intensity (CIF) and clean energy
adjustments (CEAF) are baked directly into per-model values from the TokenTrace reference papers.
The region selector is removed for text models. Multimodal models (image, audio, video) continue
to use the region multiplier.
The idea is straightforward: more tokens means more computation, which means more energy, which means more carbon. The amount of carbon per unit of computation varies by model (larger models use more energy) and by provider (clean energy procurement and data centre location affect emissions).
AI models process text as tokens — roughly, sub-word units that average about 0.75 words each (or about 4 characters). The more tokens in your prompt and the model's response, the more computation is required.
We estimate input tokens from your prompt text. Where possible, we use the model's actual tokenizer for accuracy; otherwise, we fall back to character-based heuristics.
| Method | Used for | Accuracy |
|---|---|---|
| tiktoken (BPE) | GPT-4o, GPT-4.1, GPT-5, GPT o3 | ±2% |
| 3.8 chars/token | Claude models | ±10% |
| 4.0 chars/token | Gemini models | ±12% |
| 4.0 chars/token | All other models | ±15% |
Since we can't know in advance how long a model's response will be, we use preset estimates based on the response length you select:
| Response length | Approximate words | Estimated tokens |
|---|---|---|
| Short | ~100 | 133 |
| Medium | ~300 | 400 |
| Long | ~1,000 | 1,333 |
| Very long | ~5,000 | 6,667 |
When you use the "Try It Live" feature, we replace these estimates with the actual token counts from the real API response.
Each text AI model has two key carbon values: a gross (location-based) estimate and a CEAF-adjusted (market-based) estimate that accounts for the provider's clean energy procurement. The CEAF-adjusted value is the primary metric shown throughout TokenTrace.
Our v2.0 values are derived from the TokenTrace reference papers (March 2026), which estimate per-token carbon intensity for 25 text models across 6 providers using:
| Model | Provider | Gross | CEAF-adj | Range | CEAF % | Confidence |
|---|---|---|---|---|---|---|
| Gemini 2.5 Flash-Lite | 0.25 | 0.09 | 0.06–0.12 | 66% | Medium | |
| Gemini 3.1 Flash-Lite | 0.25 | 0.09 | 0.06–0.12 | 66% | Medium | |
| Mistral Small 3 | Mistral | 0.18 | 0.09 | 0.05–0.15 | 50% | Low |
| Gemini 3 Flash | 0.30 | 0.10 | 0.07–0.14 | 66% | Medium | |
| GPT-4o Mini | OpenAI | 0.30 | 0.15 | 0.13–0.28 | 50% | Medium |
| Claude Haiku 4.5 | Anthropic | 0.31 | 0.16 | 0.10–0.25 | 50% | Low |
| GPT-5 Mini | OpenAI | 0.37 | 0.19 | 0.13–0.25 | 50% | Medium-low |
| Gemini 3 Pro | 0.60 | 0.20 | 0.14–0.28 | 66% | Low | |
| Gemini 3.1 Pro | 0.60 | 0.20 | 0.14–0.28 | 66% | Low | |
| GPT-4o | OpenAI | 0.42 | 0.21 | 0.13–0.32 | 50% | Medium |
| GPT-4.1 | OpenAI | 0.45 | 0.23 | 0.15–0.35 | 50% | Medium-low |
| Claude Sonnet 4.6 | Anthropic | 0.51 | 0.26 | 0.18–0.38 | 50% | Low |
| GPT-5 (standard) | OpenAI | 0.55 | 0.28 | 0.20–0.43 | 50% | Low |
| Mistral Medium 3 | Mistral | 0.55 | 0.28 | 0.18–0.40 | 50% | Low |
| DeepSeek V3 (Azure) | DeepSeek | 0.60 | 0.30 | 0.20–0.45 | 50% | Medium-low |
| Mistral Large 3 | Mistral | 0.75 | 0.38 | 0.25–0.55 | 50% | Medium-low |
| Claude Opus 4.6 | Anthropic | 0.78 | 0.39 | 0.28–0.55 | 50% | Low |
| Gemini 3 Flash (Thinking) | 1.80 | 0.61 | 0.15–1.12 | 66% | Low | |
| Claude Sonnet 4.6 (Thinking) | Anthropic | 1.50 | 0.75 | 0.38–2.25 | 50% | Very low |
| DeepSeek R1 (Azure) | DeepSeek | 1.80 | 0.90 | 0.50–1.75 | 50% | Low |
| Claude Opus 4.6 (Thinking) | Anthropic | 2.20 | 1.10 | 0.50–3.50 | 50% | Very low |
| GPT-5 (thinking) | OpenAI | 2.80 | 1.40 | 0.43–4.30 | 50% | Low |
| GPT-5.4 | OpenAI | 3.50 | 1.75 | 1.20–10.00 | 50% | Very low |
| DeepSeek V3 (App) | DeepSeek | 2.30 | 2.30 | 1.50–3.50 | 0% | Low |
| GPT o3 | OpenAI | 6.00 | 3.00 | 0.90–20.00 | 50% | Low |
| DeepSeek R1 (App) | DeepSeek | 7.00 | 7.00 | 4.00–12.00 | 0% | Very low |
Not all carbon estimates are created equal. The confidence of our per-model values depends on how much data the provider discloses. We classify each model into one of four tiers:
| Tier | Description | Providers |
|---|---|---|
| Provider-disclosed | Provider publishes per-query or per-token energy data | None currently |
| Independently measured | Third-party researchers have directly measured energy consumption | Google (some models), OpenAI (GPT-4o) |
| Scaled estimate | Estimated from measured models using architecture, pricing, and parameter scaling | Most OpenAI, Google, Mistral models |
| No provider data | No direct measurements; estimated from public information only | Anthropic, DeepSeek |
The CEAF adjusts gross (location-based) emissions to account for a provider's verified clean energy procurement. A provider that purchases renewable energy certificates (RECs) or has long-term power purchase agreements (PPAs) will have lower market-based emissions.
| Provider | Grid CIF tier | CEAF % | Basis |
|---|---|---|---|
| 1.0 (lowest) | 66% | 24/7 CFE matching, published sustainability reports | |
| OpenAI (Azure) | 1.5 | 50% | Microsoft renewable energy investments |
| Anthropic (multi-cloud) | 2.0 | 50% | AWS + GCP mix, estimated from cloud provider data |
| Mistral (Azure/AWS) | 1.5 | 50% | Hosted on major cloud providers with REC programmes |
| DeepSeek (China) | 3.0 | 0% | Chinese grid, no verified clean energy procurement |
| DeepSeek (Azure) | 1.5 | 50% | Hosted on Microsoft Azure |
The same computation produces very different carbon emissions depending on where the data centre is located. A data centre powered by hydroelectric energy in Sweden produces a fraction of the carbon of one running on a coal-heavy grid.
We use grid carbon intensity data from the IEA (International Energy Agency, 2023) to assign a multiplier to each major cloud region. The baseline is US West (Oregon) at 1.0×.
| Region | Multiplier | gCO₂e / kWh | Source |
|---|---|---|---|
| EU North (Sweden/Norway) | 0.2× | 29 | IEA 2023 |
| EU West (France) | 0.48× | 56 | IEA 2023 |
| US West (Oregon) | 1.0× | 210 | IEA 2023 |
| United Kingdom | 1.2× | 207 | IEA 2023 |
| EU West (Ireland) | 1.4× | 296 | IEA 2023 |
| US East (Virginia) | 1.6× | 310 | IEA 2023 |
| EU Central (Frankfurt) | 1.8× | 350 | IEA 2023 |
| US Midwest (Iowa) | 1.68× | 430 | IEA 2023 |
| Asia Pacific (Tokyo) | 2.0× | 460 | IEA 2023 |
| Asia Pacific (Mumbai) | 2.08× | 630 | IEA 2023 |
| China East (Shanghai) | 2.12× | 550 | IEA 2023 |
| Asia Pacific (Singapore) | 2.2× | 490 | IEA 2023 |
In practice, most users don't choose which data centre processes their query — the provider routes it automatically. We auto-detect a likely region from your timezone, but you can change it manually. Most major AI providers (OpenAI, Google, Anthropic) route traffic primarily through US data centres for users outside Asia.
AI isn't just text. Image generation, audio processing, and video creation have very different energy profiles. We estimate these separately using per-unit factors.
The baseline is a 1024×1024 image at 25 diffusion steps, derived from direct energy measurements by Luccioni (2024) of Stable Diffusion: approximately 2,282 joules per image. Resolution scaling is non-linear — doubling the pixel count doesn't double the energy because of how diffusion models process images.
| Model | gCO₂e / image | Confidence |
|---|---|---|
| Stable Diffusion 3 | 1.8 | Medium — direct measurement |
| Gemini Image | 2.2 | Low |
| GPT-4o Image | 2.5 | Low |
| DALL-E 3 | 2.9 | Low |
| Midjourney v6 | 3.5 | Low |
| DALL-E 3 HD | 4.5 | Low |
For speech-to-text (ASR) models like Whisper, we derive per-minute energy from published benchmarks: Whisper Large v3 processes 22 hours of audio using approximately 0.35 kWh, giving a baseline of about 0.014 gCO₂e per minute. Text-to-speech (TTS) is estimated at roughly 2–3× the energy of transcription.
Video is by far the most carbon-intensive AI modality. A single second of AI-generated video at high quality can produce 55–58 gCO₂e. Our estimates are derived from direct measurements of CogVideoX (2025), with other models estimated relative to this.
AI inference runs on GPUs (or TPUs) that draw significant power. We estimate power draw based on model size tier, informed by Patel et al. (2024) and Samsi et al. (2023):
| Model tier | Estimated power draw | Examples |
|---|---|---|
| Small (<20B params) | 150–250 W | Claude Haiku, GPT-4o Mini, Gemini Flash |
| Medium (20–100B) | 300–500 W | GPT-4o, Claude Sonnet, Gemini Pro |
| Large (>100B) | 500–800 W | Claude Opus, GPT o3, DeepSeek R1 |
Data centres use additional energy for cooling, networking, and other overhead. We assume a PUE of 1.1, typical of hyperscale operators like Google, Microsoft, and AWS. This means for every 1 kWh of compute, the data centre uses 1.1 kWh in total. Older or smaller data centres can have PUEs of 1.4–1.6, which would increase the footprint proportionally.
For scenarios where token counting is unreliable (such as code generation inside artifacts, or tool-use responses), the browser extension can fall back to a duration-based estimate:
We try to be upfront about what we don't know. Every estimate comes with an uncertainty range, calculated from five independent factors combined using root-sum-square (RSS):
| Factor | Uncertainty range | Notes |
|---|---|---|
| Model energy data | ±10–25% | Depends on whether peer-reviewed measurements exist |
| Token counting | ±2–15% | Best with tiktoken, worst with character heuristics |
| Grid carbon intensity | ±5–12% | Annual averages; real-time intensity varies by hour |
| Data centre PUE | ±8% | We assume 1.1; actual value varies by facility |
| Methodology | ±10% | Inherent limitations: batch sizes, KV cache, routing |
The composite uncertainty is capped by modality:
Alongside carbon, we calculate the financial cost of each query using the provider's published API pricing:
We also calculate a carbon offset cost — the theoretical cost of offsetting the emissions through voluntary carbon credits, at approximately $12 per tonne of CO₂e (Gold Standard market average, 2024). For most individual queries, this is a tiny fraction of a cent, which helps put the scale of AI carbon emissions in context.
TokenTrace doesn't just estimate your carbon footprint — it tells you how much carbon you saved (or added) compared to a reference baseline. This section explains how we define that baseline, what counts as a "saving", and why we think the approach is defensible.
Every carbon comparison in TokenTrace is measured against a single reference scenario: the same prompt sent to GPT-4o, generating a medium-length (300-word) response, in the same geographic region. In formula terms:
Where input_tokens is your actual prompt re-tokenised at GPT-4o's rate
(~4 characters per token), 400 is the estimated output tokens for a
300-word response, and 0.21 is GPT-4o's CEAF-adjusted carbon per 1K tokens.
A positive saving means you chose a lower-carbon option. A negative saving means your choice used more carbon than the baseline.
We chose GPT-4o as the baseline model for three reasons:
We use a medium-length (300-word, ~400-token) response as the baseline output length. This is supported by published research on real-world AI conversations:
Our 300-word (400-token) baseline falls midway between these two reference points. We believe this is the fairest choice: it avoids inflating savings (which a shorter baseline would do) and avoids understating savings (which a longer baseline would do).
Carbon savings come from four user decisions, each of which changes the actual carbon footprint relative to the baseline:
| Decision | How it creates a saving | Example |
|---|---|---|
| Model selection | Choosing a model with a lower carbon_per_1k_ceaf_adj value |
Gemini 2.5 Flash-Lite (0.09) vs GPT-4o (0.21) = 57% saving |
| Response length | Requesting a shorter response reduces output tokens | Short (100 words) vs baseline (300 words) = fewer tokens processed |
| Prompt optimisation | Writing concise prompts reduces input tokens | A 20-word prompt vs a 200-word prompt saves input-side computation |
| Region awareness | Not a direct saving in comparison (same region used for both baseline and actual), but understanding regional impact informs provider choices | Sweden (0.2×) vs Singapore (2.2×) = 11× difference |
We want to be transparent about the boundaries of this approach:
On your dashboard, we sum individual query savings over time to show your total carbon impact. This uses the same per-query formula applied to each tracked query (from the browser extension, calculator, or Try It Live feature). Cumulative savings are most meaningful when viewed as a trend rather than an absolute number.
The v2.0 methodology is based on the following TokenTrace reference papers (March 2026), which provide provider-specific carbon intensity estimates:
Our methodology draws on the following published research:
If you spot an error in our methodology, have access to better data, or want to suggest improvements, we'd genuinely love to hear from you. Getting this right matters, and we know we don't have all the answers.