How We Calculate Inference Carbon

Last updated: 14 March 2026 · Methodology v2.0.0

TokenTrace estimates the inference carbon footprint of AI queries using a combination of published research, publicly available data, and reasonable assumptions. This page explains exactly how we do it, where the numbers come from, and where we think the limitations are.

Honest caveat: No AI provider currently publishes per-query energy data. Our v2.0 estimates incorporate provider-specific grid intensity (CIF) and clean energy procurement (CEAF) data from the TokenTrace reference papers (March 2026), but they remain estimates — not precise measurements. We think approximate awareness is far better than no awareness at all.

Contents

The Core Formula (v2.0)
Token Counting
Model Carbon Intensity
Provider Disclosure Tiers
Clean Energy Adjustment Factor (CEAF)
Regional Grid Intensity
Multimodal Estimation
Energy & Infrastructure
Uncertainty & Confidence
Financial Cost
Carbon Savings Methodology
Reference Papers
Sources & References

1. The Core Formula (v2.0)

For text-based AI queries, the carbon footprint is calculated as:

Carbon estimate (CEAF-adjusted) carbon (gCO₂e) = (total_tokens / 1,000) × carbon_per_1k_ceaf_adj

Where:

total_tokens = input tokens + output tokens
carbon_per_1k_ceaf_adj = grams of CO₂e per 1,000 tokens, adjusted for the provider's clean energy procurement (see Section 5: CEAF)

    v2.0 change: In v1, a user-selected region_multiplier scaled the carbon
    estimate based on grid location. In v2.0, provider-specific grid intensity (CIF) and clean energy
    adjustments (CEAF) are baked directly into per-model values from the TokenTrace reference papers.
    The region selector is removed for text models. Multimodal models (image, audio, video) continue
    to use the region multiplier.
  

The idea is straightforward: more tokens means more computation, which means more energy, which means more carbon. The amount of carbon per unit of computation varies by model (larger models use more energy) and by provider (clean energy procurement and data centre location affect emissions).

2. Token Counting

AI models process text as tokens — roughly, sub-word units that average about 0.75 words each (or about 4 characters). The more tokens in your prompt and the model's response, the more computation is required.

Input tokens

We estimate input tokens from your prompt text. Where possible, we use the model's actual tokenizer for accuracy; otherwise, we fall back to character-based heuristics.

Method	Used for	Accuracy
tiktoken (BPE)	GPT-4o, GPT-4.1, GPT-5, GPT o3	±2%
3.8 chars/token	Claude models	±10%
4.0 chars/token	Gemini models	±12%
4.0 chars/token	All other models	±15%

Output tokens

Since we can't know in advance how long a model's response will be, we use preset estimates based on the response length you select:

Response length	Approximate words	Estimated tokens
Short	~100	133
Medium	~300	400
Long	~1,000	1,333
Very long	~5,000	6,667

When you use the "Try It Live" feature, we replace these estimates with the actual token counts from the real API response.

3. Model Carbon Intensity

Each text AI model has two key carbon values: a gross (location-based) estimate and a CEAF-adjusted (market-based) estimate that accounts for the provider's clean energy procurement. The CEAF-adjusted value is the primary metric shown throughout TokenTrace.

Our v2.0 values are derived from the TokenTrace reference papers (March 2026), which estimate per-token carbon intensity for 25 text models across 6 providers using:

Provider-published sustainability reports and energy data
Independent energy measurements where available
Grid Carbon Intensity Factors (CIF) for each provider's serving locations
Clean Energy Adjustment Factors (CEAF) for renewable energy procurement
Scaling estimates based on model architecture, parameter counts, and API pricing

Text models (sorted by CEAF-adjusted carbon, greenest first)

Model	Provider	Gross	CEAF-adj	Range	CEAF %	Confidence
Gemini 2.5 Flash-Lite	Google	0.25	0.09	0.06–0.12	66%	Medium
Gemini 3.1 Flash-Lite	Google	0.25	0.09	0.06–0.12	66%	Medium
Mistral Small 3	Mistral	0.18	0.09	0.05–0.15	50%	Low
Gemini 3 Flash	Google	0.30	0.10	0.07–0.14	66%	Medium
GPT-4o Mini	OpenAI	0.30	0.15	0.13–0.28	50%	Medium
Claude Haiku 4.5	Anthropic	0.31	0.16	0.10–0.25	50%	Low
GPT-5 Mini	OpenAI	0.37	0.19	0.13–0.25	50%	Medium-low
Gemini 3 Pro	Google	0.60	0.20	0.14–0.28	66%	Low
Gemini 3.1 Pro	Google	0.60	0.20	0.14–0.28	66%	Low
GPT-4o	OpenAI	0.42	0.21	0.13–0.32	50%	Medium
GPT-4.1	OpenAI	0.45	0.23	0.15–0.35	50%	Medium-low
Claude Sonnet 4.6	Anthropic	0.51	0.26	0.18–0.38	50%	Low
GPT-5 (standard)	OpenAI	0.55	0.28	0.20–0.43	50%	Low
Mistral Medium 3	Mistral	0.55	0.28	0.18–0.40	50%	Low
DeepSeek V3 (Azure)	DeepSeek	0.60	0.30	0.20–0.45	50%	Medium-low
Mistral Large 3	Mistral	0.75	0.38	0.25–0.55	50%	Medium-low
Claude Opus 4.6	Anthropic	0.78	0.39	0.28–0.55	50%	Low
Gemini 3 Flash (Thinking)	Google	1.80	0.61	0.15–1.12	66%	Low
Claude Sonnet 4.6 (Thinking)	Anthropic	1.50	0.75	0.38–2.25	50%	Very low
DeepSeek R1 (Azure)	DeepSeek	1.80	0.90	0.50–1.75	50%	Low
Claude Opus 4.6 (Thinking)	Anthropic	2.20	1.10	0.50–3.50	50%	Very low
GPT-5 (thinking)	OpenAI	2.80	1.40	0.43–4.30	50%	Low
GPT-5.4	OpenAI	3.50	1.75	1.20–10.00	50%	Very low
DeepSeek V3 (App)	DeepSeek	2.30	2.30	1.50–3.50	0%	Low
GPT o3	OpenAI	6.00	3.00	0.90–20.00	50%	Low
DeepSeek R1 (App)	DeepSeek	7.00	7.00	4.00–12.00	0%	Very low

    Why the big range? The most efficient model (Gemini 2.5 Flash-Lite at 0.09 CEAF-adj)
    produces roughly 78× less carbon per token than the most intensive reasoning model
    (DeepSeek R1 App at 7.00). This stems from model size, architecture, reasoning overhead,
    and crucially, the provider's clean energy procurement. Choosing a smaller model for simple
    tasks remains the most effective way to reduce your AI carbon footprint.
  

4. Provider Disclosure Tiers

Not all carbon estimates are created equal. The confidence of our per-model values depends on how much data the provider discloses. We classify each model into one of four tiers:

Tier	Description	Providers
Provider-disclosed	Provider publishes per-query or per-token energy data	None currently
Independently measured	Third-party researchers have directly measured energy consumption	Google (some models), OpenAI (GPT-4o)
Scaled estimate	Estimated from measured models using architecture, pricing, and parameter scaling	Most OpenAI, Google, Mistral models
No provider data	No direct measurements; estimated from public information only	Anthropic, DeepSeek

5. Clean Energy Adjustment Factor (CEAF)

The CEAF adjusts gross (location-based) emissions to account for a provider's verified clean energy procurement. A provider that purchases renewable energy certificates (RECs) or has long-term power purchase agreements (PPAs) will have lower market-based emissions.

CEAF adjustment carbon_ceaf_adj = carbon_gross × (1 − CEAF)

Provider	Grid CIF tier	CEAF %	Basis
Google	1.0 (lowest)	66%	24/7 CFE matching, published sustainability reports
OpenAI (Azure)	1.5	50%	Microsoft renewable energy investments
Anthropic (multi-cloud)	2.0	50%	AWS + GCP mix, estimated from cloud provider data
Mistral (Azure/AWS)	1.5	50%	Hosted on major cloud providers with REC programmes
DeepSeek (China)	3.0	0%	Chinese grid, no verified clean energy procurement
DeepSeek (Azure)	1.5	50%	Hosted on Microsoft Azure

CEAF limitations: The CEAF is based on annual averages and corporate-level claims. Real-time clean energy matching varies hourly. Google's 24/7 CFE programme is the most granular; other providers may over-claim on an hourly basis. We apply conservative CEAF values and plan to update them as more granular data becomes available.

6. Regional Grid Intensity

v2.0 note: For text models, regional grid intensity is now embedded in the per-model CIF/CEAF values (see Section 5). The region selector and multiplier below apply only to multimodal models (image, audio, video).

The same computation produces very different carbon emissions depending on where the data centre is located. A data centre powered by hydroelectric energy in Sweden produces a fraction of the carbon of one running on a coal-heavy grid.

We use grid carbon intensity data from the IEA (International Energy Agency, 2023) to assign a multiplier to each major cloud region. The baseline is US West (Oregon) at 1.0×.

Region	Multiplier	gCO₂e / kWh	Source
EU North (Sweden/Norway)	0.2×	29	IEA 2023
EU West (France)	0.48×	56	IEA 2023
US West (Oregon)	1.0×	210	IEA 2023
United Kingdom	1.2×	207	IEA 2023
EU West (Ireland)	1.4×	296	IEA 2023
US East (Virginia)	1.6×	310	IEA 2023
EU Central (Frankfurt)	1.8×	350	IEA 2023
US Midwest (Iowa)	1.68×	430	IEA 2023
Asia Pacific (Tokyo)	2.0×	460	IEA 2023
Asia Pacific (Mumbai)	2.08×	630	IEA 2023
China East (Shanghai)	2.12×	550	IEA 2023
Asia Pacific (Singapore)	2.2×	490	IEA 2023

In practice, most users don't choose which data centre processes their query — the provider routes it automatically. We auto-detect a likely region from your timezone, but you can change it manually. Most major AI providers (OpenAI, Google, Anthropic) route traffic primarily through US data centres for users outside Asia.

7. Multimodal Estimation

AI isn't just text. Image generation, audio processing, and video creation have very different energy profiles. We estimate these separately using per-unit factors.

7.1 Image generation

Image carbon carbon = carbon_per_image × resolution_multiplier × steps_multiplier × region_multiplier

The baseline is a 1024×1024 image at 25 diffusion steps, derived from direct energy measurements by Luccioni (2024) of Stable Diffusion: approximately 2,282 joules per image. Resolution scaling is non-linear — doubling the pixel count doesn't double the energy because of how diffusion models process images.

Model	gCO₂e / image	Confidence
Stable Diffusion 3	1.8	Medium — direct measurement
Gemini Image	2.2	Low
GPT-4o Image	2.5	Low
DALL-E 3	2.9	Low
Midjourney v6	3.5	Low
DALL-E 3 HD	4.5	Low

7.2 Audio processing

Audio carbon carbon = carbon_per_minute × duration_minutes × region_multiplier

For speech-to-text (ASR) models like Whisper, we derive per-minute energy from published benchmarks: Whisper Large v3 processes 22 hours of audio using approximately 0.35 kWh, giving a baseline of about 0.014 gCO₂e per minute. Text-to-speech (TTS) is estimated at roughly 2–3× the energy of transcription.

7.3 Video generation

Video carbon carbon = carbon_per_second × duration_seconds × quality_multiplier × region_multiplier

Video is by far the most carbon-intensive AI modality. A single second of AI-generated video at high quality can produce 55–58 gCO₂e. Our estimates are derived from direct measurements of CogVideoX (2025), with other models estimated relative to this.

Video uncertainty is high (±50–60%). No standardised benchmarks exist for AI video generation energy. These figures should be treated as rough order-of-magnitude estimates.

8. Energy & Infrastructure

8.1 GPU power consumption

AI inference runs on GPUs (or TPUs) that draw significant power. We estimate power draw based on model size tier, informed by Patel et al. (2024) and Samsi et al. (2023):

Model tier	Estimated power draw	Examples
Small (<20B params)	150–250 W	Claude Haiku, GPT-4o Mini, Gemini Flash
Medium (20–100B)	300–500 W	GPT-4o, Claude Sonnet, Gemini Pro
Large (>100B)	500–800 W	Claude Opus, GPT o3, DeepSeek R1

8.2 Power Usage Effectiveness (PUE)

Data centres use additional energy for cooling, networking, and other overhead. We assume a PUE of 1.1, typical of hyperscale operators like Google, Microsoft, and AWS. This means for every 1 kWh of compute, the data centre uses 1.1 kWh in total. Older or smaller data centres can have PUEs of 1.4–1.6, which would increase the footprint proportionally.

8.3 Duration-based fallback

For scenarios where token counting is unreliable (such as code generation inside artifacts, or tool-use responses), the browser extension can fall back to a duration-based estimate:

Duration-based estimate energy (Wh) = generation_seconds × watts / 3,600
carbon (gCO₂e) = energy_wh × grid_intensity × PUE

9. Uncertainty & Confidence

We try to be upfront about what we don't know. Every estimate comes with an uncertainty range, calculated from five independent factors combined using root-sum-square (RSS):

Factor	Uncertainty range	Notes
Model energy data	±10–25%	Depends on whether peer-reviewed measurements exist
Token counting	±2–15%	Best with tiktoken, worst with character heuristics
Grid carbon intensity	±5–12%	Annual averages; real-time intensity varies by hour
Data centre PUE	±8%	We assume 1.1; actual value varies by facility
Methodology	±10%	Inherent limitations: batch sizes, KV cache, routing

The composite uncertainty is capped by modality:

Text: ±30% — the best-studied modality
Audio: ±30% — comparable to text
Image: ±35% — variable due to resolution and diffusion steps
Video: ±55% — very limited benchmarks

    What this means in practice: If we estimate a query at 0.50 gCO₂e
    with ±30% uncertainty, the actual value is likely between 0.35 and 0.65 gCO₂e.
    The relative comparisons between models and regions are more reliable than the absolute
    numbers — a model that shows 2× the carbon of another really is significantly
    more energy-intensive.
  

10. Financial Cost

Alongside carbon, we calculate the financial cost of each query using the provider's published API pricing:

Financial cost cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price

We also calculate a carbon offset cost — the theoretical cost of offsetting the emissions through voluntary carbon credits, at approximately $12 per tonne of CO₂e (Gold Standard market average, 2024). For most individual queries, this is a tiny fraction of a cent, which helps put the scale of AI carbon emissions in context.

11. Carbon Savings Methodology

TokenTrace doesn't just estimate your carbon footprint — it tells you how much carbon you saved (or added) compared to a reference baseline. This section explains how we define that baseline, what counts as a "saving", and why we think the approach is defensible.

11.1 The baseline: GPT-4o at 300 words

Every carbon comparison in TokenTrace is measured against a single reference scenario: the same prompt sent to GPT-4o, generating a medium-length (300-word) response, in the same geographic region. In formula terms:

Baseline carbon (v2.0) baseline (gCO₂e) = ((input_tokens + 400) / 1,000) × 0.21

Where input_tokens is your actual prompt re-tokenised at GPT-4o's rate (~4 characters per token), 400 is the estimated output tokens for a 300-word response, and 0.21 is GPT-4o's CEAF-adjusted carbon per 1K tokens.

Carbon saving saving (gCO₂e) = baseline − actual

A positive saving means you chose a lower-carbon option. A negative saving means your choice used more carbon than the baseline.

11.2 Why GPT-4o?

We chose GPT-4o as the baseline model for three reasons:

Market share: ChatGPT holds approximately 68% of the AI chatbot market (Similarweb, January 2026), and GPT-4o is its default model for both free and paid users. It is the single most widely used AI model in the world.
Industry standard: GPT-4o is the de facto reference model used by researchers (Epoch AI, 2025), benchmarking platforms, and competing providers (Google, Anthropic, Meta) for performance and energy comparisons.
Mid-range carbon intensity: At 0.21 gCO₂e per 1,000 tokens (CEAF-adjusted), GPT-4o sits in the middle of our model range (0.09–7.00). It is neither the cleanest nor the most carbon-intensive option, making it a fair yardstick rather than a cherry-picked extreme.

11.3 Why 300 words?

We use a medium-length (300-word, ~400-token) response as the baseline output length. This is supported by published research on real-world AI conversations:

LMSYS-Chat-1M (Zheng, Chiang et al., ICLR 2024) — the largest public dataset of real LLM conversations (1 million exchanges across 25 models) found an average model response of 214.5 tokens (~160 words).
Epoch AI (February 2025) used 500 tokens (~375 words) as a "typical but somewhat pessimistic" estimate for energy calculations.

Our 300-word (400-token) baseline falls midway between these two reference points. We believe this is the fairest choice: it avoids inflating savings (which a shorter baseline would do) and avoids understating savings (which a longer baseline would do).

11.4 What counts as a saving?

Carbon savings come from four user decisions, each of which changes the actual carbon footprint relative to the baseline:

Decision	How it creates a saving	Example
Model selection	Choosing a model with a lower `carbon_per_1k_ceaf_adj` value	Gemini 2.5 Flash-Lite (0.09) vs GPT-4o (0.21) = 57% saving
Response length	Requesting a shorter response reduces output tokens	Short (100 words) vs baseline (300 words) = fewer tokens processed
Prompt optimisation	Writing concise prompts reduces input tokens	A 20-word prompt vs a 200-word prompt saves input-side computation
Region awareness	Not a direct saving in comparison (same region used for both baseline and actual), but understanding regional impact informs provider choices	Sweden (0.2×) vs Singapore (2.2×) = 11× difference

    Important: The baseline uses your actual prompt and
    your selected region. This means the comparison isolates the effect of
    your model choice and response length — the two decisions most directly within
    your control.
  

11.5 Honest limitations

We want to be transparent about the boundaries of this approach:

The baseline is hypothetical. We estimate what would have happened if you'd used GPT-4o at 300 words. You may never have intended to use GPT-4o, so the "saving" is relative to a counterfactual, not a known fact.
Output tokens are estimated. In the calculator, the baseline uses a fixed 400-token output estimate. When you use "Try It Live", we have your actual token count but still compare against the 400-token baseline. Real responses vary widely.
Some model choices increase carbon. If you select a reasoning model like DeepSeek R1 App (7.00 gCO₂e/1K tokens), the "saving" is negative — you used more carbon than the baseline. We show this honestly rather than hiding it.
The baseline will evolve. As market share shifts and new models emerge, the reference model may need updating. We'll document any changes here and explain the rationale.
Savings inherit all uncertainty. Since both the baseline and actual estimates carry ±30% uncertainty (see Section 7), the savings figure has a combined uncertainty of roughly ±40%. Small differences between similar models should be interpreted with caution.

11.6 Cumulative savings

On your dashboard, we sum individual query savings over time to show your total carbon impact. This uses the same per-query formula applied to each tracked query (from the browser extension, calculator, or Try It Live feature). Cumulative savings are most meaningful when viewed as a trend rather than an absolute number.

A note on "savings" vs "reductions": TokenTrace shows how your AI usage compares to a reference scenario. We deliberately use the word "saving" (vs baseline) rather than "reduction" (vs your own past behaviour) because we are comparing against a hypothetical, not tracking a personal journey over time. Both framings have value; ours is designed to be immediately actionable for every query.

12. Reference Papers

The v2.0 methodology is based on the following TokenTrace reference papers (March 2026), which provide provider-specific carbon intensity estimates:

Estimating the Inference Carbon Intensity of Google Gemini Models — Gemini model family, 24/7 CFE data, CIF tier 1.0
Estimating the Inference Carbon Intensity of OpenAI ChatGPT Models — GPT-4o/4.1/5/o3 families, Azure infrastructure, CIF tier 1.5
Estimating the Inference Carbon Intensity of Anthropic’s Claude Models — Claude Haiku/Sonnet/Opus families, multi-cloud (AWS+GCP), CIF tier 2.0
Estimating the Inference Carbon Intensity of DeepSeek’s Models — V3/R1 models, dual-pathway (China app vs Azure), CIF tiers 3.0/1.5
Estimating the Inference Carbon Intensity of Mistral’s Models — Small/Medium/Large 3 families, hosted on Azure/AWS, CIF tier 1.5

13. Sources & References

Our methodology draws on the following published research:

Luccioni, A.S., Viguier, S., & Ligozat, A.-L. (2023). "Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model." Journal of Machine Learning Research, 24(253), 1–15.
Direct measurements of model energy per token — our primary baseline source.
Luccioni, A.S. (2024). "Measuring the Energy Consumption of AI Image Generation."
Direct energy measurements of Stable Diffusion at various resolutions and step counts.
Patterson, D., et al. (2021, 2022). "Carbon Emissions and Large Neural Network Training." arXiv / IEEE Computer.
Energy scaling laws, GPU utilisation patterns, and efficiency trends.
Dodge, J., et al. (2022). "Measuring the Carbon Intensity of AI in Cloud Instances." FAccT 2022.
Uncertainty quantification methods for ML carbon estimation.
Patel, A., et al. (2024). "Splitwise: Efficient GPU Inference with Model Parallelism." ASPLOS 2024.
GPU power draw at 60–80% TDP during inference workloads.
Jegham, I., et al. (2025). "Towards Sustainable AI: A Comprehensive Framework for Green Large Language Models." arXiv:2505.09598v6.
Independent energy measurements per token across major LLM providers — the primary anchor data source for all v2 carbon estimates.
Google (2025). "Measuring Environmental Impact of AI Inference." arXiv:2508.15734.
Google’s own measurement of 0.24 Wh per median Gemini prompt — cross-validates the Jegham framework.
Altman, S. (June 2025). "The Gentle Singularity."
CEO disclosure of 0.34 Wh average ChatGPT query — the anchor data point for all OpenAI estimates.
Mistral AI / Carbone 4 (2025). "Mistral Large 2 Life Cycle Assessment."
Provider-published, peer-reviewed LCA — the only independently audited per-token carbon figure in the industry (1.14 gCO₂e/400 tokens).
Samsi, S., et al. (2023). "Scaling Large Language Models on Edge Accelerators." IEEE HPEC 2023.
Energy per token estimates for large models.
International Energy Agency (2023, 2025). "Emission Factors" / "Electricity Mid-Year Update 2025."
Regional and national grid carbon intensity data (gCO₂e/kWh), including Chinese grid CIF forecasts.
EPA eGRID (2024). "Emissions & Generation Resource Integrated Database."
US grid carbon intensity data (370 gCO₂e/kWh) used for OpenAI and Anthropic gross estimates.
Ember (2024). "Global Electricity Review."
Chinese grid carbon intensity (581 gCO₂e/kWh) used for DeepSeek domestic estimates.
GHG Protocol (2025). "Scope 2 Guidance: Location-Based and Market-Based Accounting."
Framework for our dual-reporting approach (gross location-based vs. CEAF market-adjusted).
Artificial Analysis (2026). "LLM Throughput Benchmarks."
Third-party throughput measurements (tokens/second) used for scaling estimates across all providers.
DeepSeek (2024, 2025). "DeepSeek-V3 Technical Report" / "DeepSeek-R1 Paper."
Architecture details (MoE, 671B/37B active parameters) and training methodology.
Zheng, L., Chiang, W., et al. (2024). "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset." ICLR 2024.
Average response length of 214.5 tokens across 1M real conversations — used to validate our 300-word baseline.
Epoch AI (2025). "How Much Energy Does ChatGPT Use?"
Uses 500 output tokens as a typical estimate for GPT-4o energy calculations — upper bound for our baseline validation.

Feedback

If you spot an error in our methodology, have access to better data, or want to suggest improvements, we'd genuinely love to hear from you. Getting this right matters, and we know we don't have all the answers.

Email: hello@carbonfirefly.ai
Website: www.carbonfirefly.ai