TokenTrace .ai

How We Calculate Inference Carbon

Last updated: 14 March 2026 · Methodology v2.0.0

TokenTrace estimates the inference carbon footprint of AI queries using a combination of published research, publicly available data, and reasonable assumptions. This page explains exactly how we do it, where the numbers come from, and where we think the limitations are.

Honest caveat: No AI provider currently publishes per-query energy data. Our v2.0 estimates incorporate provider-specific grid intensity (CIF) and clean energy procurement (CEAF) data from the TokenTrace reference papers (March 2026), but they remain estimates — not precise measurements. We think approximate awareness is far better than no awareness at all.

Contents

  1. The Core Formula (v2.0)
  2. Token Counting
  3. Model Carbon Intensity
  4. Provider Disclosure Tiers
  5. Clean Energy Adjustment Factor (CEAF)
  6. Regional Grid Intensity
  7. Multimodal Estimation
  8. Energy & Infrastructure
  9. Uncertainty & Confidence
  10. Financial Cost
  11. Carbon Savings Methodology
  12. Reference Papers
  13. Sources & References

1. The Core Formula (v2.0)

For text-based AI queries, the carbon footprint is calculated as:

Carbon estimate (CEAF-adjusted) carbon (gCO₂e) = (total_tokens / 1,000) × carbon_per_1k_ceaf_adj

Where:

v2.0 change: In v1, a user-selected region_multiplier scaled the carbon estimate based on grid location. In v2.0, provider-specific grid intensity (CIF) and clean energy adjustments (CEAF) are baked directly into per-model values from the TokenTrace reference papers. The region selector is removed for text models. Multimodal models (image, audio, video) continue to use the region multiplier.

The idea is straightforward: more tokens means more computation, which means more energy, which means more carbon. The amount of carbon per unit of computation varies by model (larger models use more energy) and by provider (clean energy procurement and data centre location affect emissions).

2. Token Counting

AI models process text as tokens — roughly, sub-word units that average about 0.75 words each (or about 4 characters). The more tokens in your prompt and the model's response, the more computation is required.

Input tokens

We estimate input tokens from your prompt text. Where possible, we use the model's actual tokenizer for accuracy; otherwise, we fall back to character-based heuristics.

Method Used for Accuracy
tiktoken (BPE) GPT-4o, GPT-4.1, GPT-5, GPT o3 ±2%
3.8 chars/token Claude models ±10%
4.0 chars/token Gemini models ±12%
4.0 chars/token All other models ±15%

Output tokens

Since we can't know in advance how long a model's response will be, we use preset estimates based on the response length you select:

Response length Approximate words Estimated tokens
Short~100133
Medium~300400
Long~1,0001,333
Very long~5,0006,667

When you use the "Try It Live" feature, we replace these estimates with the actual token counts from the real API response.

3. Model Carbon Intensity

Each text AI model has two key carbon values: a gross (location-based) estimate and a CEAF-adjusted (market-based) estimate that accounts for the provider's clean energy procurement. The CEAF-adjusted value is the primary metric shown throughout TokenTrace.

Our v2.0 values are derived from the TokenTrace reference papers (March 2026), which estimate per-token carbon intensity for 25 text models across 6 providers using:

Text models (sorted by CEAF-adjusted carbon, greenest first)

Model Provider Gross CEAF-adj Range CEAF % Confidence
Gemini 2.5 Flash-LiteGoogle0.250.090.06–0.1266%Medium
Gemini 3.1 Flash-LiteGoogle0.250.090.06–0.1266%Medium
Mistral Small 3Mistral0.180.090.05–0.1550%Low
Gemini 3 FlashGoogle0.300.100.07–0.1466%Medium
GPT-4o MiniOpenAI0.300.150.13–0.2850%Medium
Claude Haiku 4.5Anthropic0.310.160.10–0.2550%Low
GPT-5 MiniOpenAI0.370.190.13–0.2550%Medium-low
Gemini 3 ProGoogle0.600.200.14–0.2866%Low
Gemini 3.1 ProGoogle0.600.200.14–0.2866%Low
GPT-4oOpenAI0.420.210.13–0.3250%Medium
GPT-4.1OpenAI0.450.230.15–0.3550%Medium-low
Claude Sonnet 4.6Anthropic0.510.260.18–0.3850%Low
GPT-5 (standard)OpenAI0.550.280.20–0.4350%Low
Mistral Medium 3Mistral0.550.280.18–0.4050%Low
DeepSeek V3 (Azure)DeepSeek0.600.300.20–0.4550%Medium-low
Mistral Large 3Mistral0.750.380.25–0.5550%Medium-low
Claude Opus 4.6Anthropic0.780.390.28–0.5550%Low
Gemini 3 Flash (Thinking)Google1.800.610.15–1.1266%Low
Claude Sonnet 4.6 (Thinking)Anthropic1.500.750.38–2.2550%Very low
DeepSeek R1 (Azure)DeepSeek1.800.900.50–1.7550%Low
Claude Opus 4.6 (Thinking)Anthropic2.201.100.50–3.5050%Very low
GPT-5 (thinking)OpenAI2.801.400.43–4.3050%Low
GPT-5.4OpenAI3.501.751.20–10.0050%Very low
DeepSeek V3 (App)DeepSeek2.302.301.50–3.500%Low
GPT o3OpenAI6.003.000.90–20.0050%Low
DeepSeek R1 (App)DeepSeek7.007.004.00–12.000%Very low
Why the big range? The most efficient model (Gemini 2.5 Flash-Lite at 0.09 CEAF-adj) produces roughly 78× less carbon per token than the most intensive reasoning model (DeepSeek R1 App at 7.00). This stems from model size, architecture, reasoning overhead, and crucially, the provider's clean energy procurement. Choosing a smaller model for simple tasks remains the most effective way to reduce your AI carbon footprint.

4. Provider Disclosure Tiers

Not all carbon estimates are created equal. The confidence of our per-model values depends on how much data the provider discloses. We classify each model into one of four tiers:

TierDescriptionProviders
Provider-disclosed Provider publishes per-query or per-token energy data None currently
Independently measured Third-party researchers have directly measured energy consumption Google (some models), OpenAI (GPT-4o)
Scaled estimate Estimated from measured models using architecture, pricing, and parameter scaling Most OpenAI, Google, Mistral models
No provider data No direct measurements; estimated from public information only Anthropic, DeepSeek

5. Clean Energy Adjustment Factor (CEAF)

The CEAF adjusts gross (location-based) emissions to account for a provider's verified clean energy procurement. A provider that purchases renewable energy certificates (RECs) or has long-term power purchase agreements (PPAs) will have lower market-based emissions.

CEAF adjustment carbon_ceaf_adj = carbon_gross × (1 − CEAF)
ProviderGrid CIF tierCEAF %Basis
Google1.0 (lowest)66%24/7 CFE matching, published sustainability reports
OpenAI (Azure)1.550%Microsoft renewable energy investments
Anthropic (multi-cloud)2.050%AWS + GCP mix, estimated from cloud provider data
Mistral (Azure/AWS)1.550%Hosted on major cloud providers with REC programmes
DeepSeek (China)3.00%Chinese grid, no verified clean energy procurement
DeepSeek (Azure)1.550%Hosted on Microsoft Azure
CEAF limitations: The CEAF is based on annual averages and corporate-level claims. Real-time clean energy matching varies hourly. Google's 24/7 CFE programme is the most granular; other providers may over-claim on an hourly basis. We apply conservative CEAF values and plan to update them as more granular data becomes available.

6. Regional Grid Intensity

v2.0 note: For text models, regional grid intensity is now embedded in the per-model CIF/CEAF values (see Section 5). The region selector and multiplier below apply only to multimodal models (image, audio, video).

The same computation produces very different carbon emissions depending on where the data centre is located. A data centre powered by hydroelectric energy in Sweden produces a fraction of the carbon of one running on a coal-heavy grid.

We use grid carbon intensity data from the IEA (International Energy Agency, 2023) to assign a multiplier to each major cloud region. The baseline is US West (Oregon) at 1.0×.

Region Multiplier gCO₂e / kWh Source
EU North (Sweden/Norway)0.2×29IEA 2023
EU West (France)0.48×56IEA 2023
US West (Oregon)1.0×210IEA 2023
United Kingdom1.2×207IEA 2023
EU West (Ireland)1.4×296IEA 2023
US East (Virginia)1.6×310IEA 2023
EU Central (Frankfurt)1.8×350IEA 2023
US Midwest (Iowa)1.68×430IEA 2023
Asia Pacific (Tokyo)2.0×460IEA 2023
Asia Pacific (Mumbai)2.08×630IEA 2023
China East (Shanghai)2.12×550IEA 2023
Asia Pacific (Singapore)2.2×490IEA 2023

In practice, most users don't choose which data centre processes their query — the provider routes it automatically. We auto-detect a likely region from your timezone, but you can change it manually. Most major AI providers (OpenAI, Google, Anthropic) route traffic primarily through US data centres for users outside Asia.

7. Multimodal Estimation

AI isn't just text. Image generation, audio processing, and video creation have very different energy profiles. We estimate these separately using per-unit factors.

7.1 Image generation

Image carbon carbon = carbon_per_image × resolution_multiplier × steps_multiplier × region_multiplier

The baseline is a 1024×1024 image at 25 diffusion steps, derived from direct energy measurements by Luccioni (2024) of Stable Diffusion: approximately 2,282 joules per image. Resolution scaling is non-linear — doubling the pixel count doesn't double the energy because of how diffusion models process images.

ModelgCO₂e / imageConfidence
Stable Diffusion 31.8Medium — direct measurement
Gemini Image2.2Low
GPT-4o Image2.5Low
DALL-E 32.9Low
Midjourney v63.5Low
DALL-E 3 HD4.5Low

7.2 Audio processing

Audio carbon carbon = carbon_per_minute × duration_minutes × region_multiplier

For speech-to-text (ASR) models like Whisper, we derive per-minute energy from published benchmarks: Whisper Large v3 processes 22 hours of audio using approximately 0.35 kWh, giving a baseline of about 0.014 gCO₂e per minute. Text-to-speech (TTS) is estimated at roughly 2–3× the energy of transcription.

7.3 Video generation

Video carbon carbon = carbon_per_second × duration_seconds × quality_multiplier × region_multiplier

Video is by far the most carbon-intensive AI modality. A single second of AI-generated video at high quality can produce 55–58 gCO₂e. Our estimates are derived from direct measurements of CogVideoX (2025), with other models estimated relative to this.

Video uncertainty is high (±50–60%). No standardised benchmarks exist for AI video generation energy. These figures should be treated as rough order-of-magnitude estimates.

8. Energy & Infrastructure

8.1 GPU power consumption

AI inference runs on GPUs (or TPUs) that draw significant power. We estimate power draw based on model size tier, informed by Patel et al. (2024) and Samsi et al. (2023):

Model tierEstimated power drawExamples
Small (<20B params)150–250 WClaude Haiku, GPT-4o Mini, Gemini Flash
Medium (20–100B)300–500 WGPT-4o, Claude Sonnet, Gemini Pro
Large (>100B)500–800 WClaude Opus, GPT o3, DeepSeek R1

8.2 Power Usage Effectiveness (PUE)

Data centres use additional energy for cooling, networking, and other overhead. We assume a PUE of 1.1, typical of hyperscale operators like Google, Microsoft, and AWS. This means for every 1 kWh of compute, the data centre uses 1.1 kWh in total. Older or smaller data centres can have PUEs of 1.4–1.6, which would increase the footprint proportionally.

8.3 Duration-based fallback

For scenarios where token counting is unreliable (such as code generation inside artifacts, or tool-use responses), the browser extension can fall back to a duration-based estimate:

Duration-based estimate energy (Wh) = generation_seconds × watts / 3,600
carbon (gCO₂e) = energy_wh × grid_intensity × PUE

9. Uncertainty & Confidence

We try to be upfront about what we don't know. Every estimate comes with an uncertainty range, calculated from five independent factors combined using root-sum-square (RSS):

FactorUncertainty rangeNotes
Model energy data±10–25% Depends on whether peer-reviewed measurements exist
Token counting±2–15% Best with tiktoken, worst with character heuristics
Grid carbon intensity±5–12% Annual averages; real-time intensity varies by hour
Data centre PUE±8% We assume 1.1; actual value varies by facility
Methodology±10% Inherent limitations: batch sizes, KV cache, routing

The composite uncertainty is capped by modality:

What this means in practice: If we estimate a query at 0.50 gCO₂e with ±30% uncertainty, the actual value is likely between 0.35 and 0.65 gCO₂e. The relative comparisons between models and regions are more reliable than the absolute numbers — a model that shows 2× the carbon of another really is significantly more energy-intensive.

10. Financial Cost

Alongside carbon, we calculate the financial cost of each query using the provider's published API pricing:

Financial cost cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price

We also calculate a carbon offset cost — the theoretical cost of offsetting the emissions through voluntary carbon credits, at approximately $12 per tonne of CO₂e (Gold Standard market average, 2024). For most individual queries, this is a tiny fraction of a cent, which helps put the scale of AI carbon emissions in context.

11. Carbon Savings Methodology

TokenTrace doesn't just estimate your carbon footprint — it tells you how much carbon you saved (or added) compared to a reference baseline. This section explains how we define that baseline, what counts as a "saving", and why we think the approach is defensible.

11.1 The baseline: GPT-4o at 300 words

Every carbon comparison in TokenTrace is measured against a single reference scenario: the same prompt sent to GPT-4o, generating a medium-length (300-word) response, in the same geographic region. In formula terms:

Baseline carbon (v2.0) baseline (gCO₂e) = ((input_tokens + 400) / 1,000) × 0.21

Where input_tokens is your actual prompt re-tokenised at GPT-4o's rate (~4 characters per token), 400 is the estimated output tokens for a 300-word response, and 0.21 is GPT-4o's CEAF-adjusted carbon per 1K tokens.

Carbon saving saving (gCO₂e) = baseline − actual

A positive saving means you chose a lower-carbon option. A negative saving means your choice used more carbon than the baseline.

11.2 Why GPT-4o?

We chose GPT-4o as the baseline model for three reasons:

  1. Market share: ChatGPT holds approximately 68% of the AI chatbot market (Similarweb, January 2026), and GPT-4o is its default model for both free and paid users. It is the single most widely used AI model in the world.
  2. Industry standard: GPT-4o is the de facto reference model used by researchers (Epoch AI, 2025), benchmarking platforms, and competing providers (Google, Anthropic, Meta) for performance and energy comparisons.
  3. Mid-range carbon intensity: At 0.21 gCO₂e per 1,000 tokens (CEAF-adjusted), GPT-4o sits in the middle of our model range (0.09–7.00). It is neither the cleanest nor the most carbon-intensive option, making it a fair yardstick rather than a cherry-picked extreme.

11.3 Why 300 words?

We use a medium-length (300-word, ~400-token) response as the baseline output length. This is supported by published research on real-world AI conversations:

Our 300-word (400-token) baseline falls midway between these two reference points. We believe this is the fairest choice: it avoids inflating savings (which a shorter baseline would do) and avoids understating savings (which a longer baseline would do).

11.4 What counts as a saving?

Carbon savings come from four user decisions, each of which changes the actual carbon footprint relative to the baseline:

Decision How it creates a saving Example
Model selection Choosing a model with a lower carbon_per_1k_ceaf_adj value Gemini 2.5 Flash-Lite (0.09) vs GPT-4o (0.21) = 57% saving
Response length Requesting a shorter response reduces output tokens Short (100 words) vs baseline (300 words) = fewer tokens processed
Prompt optimisation Writing concise prompts reduces input tokens A 20-word prompt vs a 200-word prompt saves input-side computation
Region awareness Not a direct saving in comparison (same region used for both baseline and actual), but understanding regional impact informs provider choices Sweden (0.2×) vs Singapore (2.2×) = 11× difference
Important: The baseline uses your actual prompt and your selected region. This means the comparison isolates the effect of your model choice and response length — the two decisions most directly within your control.

11.5 Honest limitations

We want to be transparent about the boundaries of this approach:

11.6 Cumulative savings

On your dashboard, we sum individual query savings over time to show your total carbon impact. This uses the same per-query formula applied to each tracked query (from the browser extension, calculator, or Try It Live feature). Cumulative savings are most meaningful when viewed as a trend rather than an absolute number.

A note on "savings" vs "reductions": TokenTrace shows how your AI usage compares to a reference scenario. We deliberately use the word "saving" (vs baseline) rather than "reduction" (vs your own past behaviour) because we are comparing against a hypothetical, not tracking a personal journey over time. Both framings have value; ours is designed to be immediately actionable for every query.

12. Reference Papers

The v2.0 methodology is based on the following TokenTrace reference papers (March 2026), which provide provider-specific carbon intensity estimates:

13. Sources & References

Our methodology draws on the following published research:

  1. Luccioni, A.S., Viguier, S., & Ligozat, A.-L. (2023). "Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model." Journal of Machine Learning Research, 24(253), 1–15.
    Direct measurements of model energy per token — our primary baseline source.
  2. Luccioni, A.S. (2024). "Measuring the Energy Consumption of AI Image Generation."
    Direct energy measurements of Stable Diffusion at various resolutions and step counts.
  3. Patterson, D., et al. (2021, 2022). "Carbon Emissions and Large Neural Network Training." arXiv / IEEE Computer.
    Energy scaling laws, GPU utilisation patterns, and efficiency trends.
  4. Dodge, J., et al. (2022). "Measuring the Carbon Intensity of AI in Cloud Instances." FAccT 2022.
    Uncertainty quantification methods for ML carbon estimation.
  5. Patel, A., et al. (2024). "Splitwise: Efficient GPU Inference with Model Parallelism." ASPLOS 2024.
    GPU power draw at 60–80% TDP during inference workloads.
  6. Jegham, I., et al. (2025). "Towards Sustainable AI: A Comprehensive Framework for Green Large Language Models." arXiv:2505.09598v6.
    Independent energy measurements per token across major LLM providers — the primary anchor data source for all v2 carbon estimates.
  7. Google (2025). "Measuring Environmental Impact of AI Inference." arXiv:2508.15734.
    Google’s own measurement of 0.24 Wh per median Gemini prompt — cross-validates the Jegham framework.
  8. Altman, S. (June 2025). "The Gentle Singularity."
    CEO disclosure of 0.34 Wh average ChatGPT query — the anchor data point for all OpenAI estimates.
  9. Mistral AI / Carbone 4 (2025). "Mistral Large 2 Life Cycle Assessment."
    Provider-published, peer-reviewed LCA — the only independently audited per-token carbon figure in the industry (1.14 gCO₂e/400 tokens).
  10. Samsi, S., et al. (2023). "Scaling Large Language Models on Edge Accelerators." IEEE HPEC 2023.
    Energy per token estimates for large models.
  11. International Energy Agency (2023, 2025). "Emission Factors" / "Electricity Mid-Year Update 2025."
    Regional and national grid carbon intensity data (gCO₂e/kWh), including Chinese grid CIF forecasts.
  12. EPA eGRID (2024). "Emissions & Generation Resource Integrated Database."
    US grid carbon intensity data (370 gCO₂e/kWh) used for OpenAI and Anthropic gross estimates.
  13. Ember (2024). "Global Electricity Review."
    Chinese grid carbon intensity (581 gCO₂e/kWh) used for DeepSeek domestic estimates.
  14. GHG Protocol (2025). "Scope 2 Guidance: Location-Based and Market-Based Accounting."
    Framework for our dual-reporting approach (gross location-based vs. CEAF market-adjusted).
  15. Artificial Analysis (2026). "LLM Throughput Benchmarks."
    Third-party throughput measurements (tokens/second) used for scaling estimates across all providers.
  16. DeepSeek (2024, 2025). "DeepSeek-V3 Technical Report" / "DeepSeek-R1 Paper."
    Architecture details (MoE, 671B/37B active parameters) and training methodology.
  17. Zheng, L., Chiang, W., et al. (2024). "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset." ICLR 2024.
    Average response length of 214.5 tokens across 1M real conversations — used to validate our 300-word baseline.
  18. Epoch AI (2025). "How Much Energy Does ChatGPT Use?"
    Uses 500 output tokens as a typical estimate for GPT-4o energy calculations — upper bound for our baseline validation.

Feedback

If you spot an error in our methodology, have access to better data, or want to suggest improvements, we'd genuinely love to hear from you. Getting this right matters, and we know we don't have all the answers.