Skip to content

[mirror] fix(block_cost_config): audit + correct stale LLM/block rates + migrate generic ReplicateModelBlock to COST_USD#5

Open
yashwant86 wants to merge 8 commits intomm-base-12912from
mm-pr-12912
Open

[mirror] fix(block_cost_config): audit + correct stale LLM/block rates + migrate generic ReplicateModelBlock to COST_USD#5
yashwant86 wants to merge 8 commits intomm-base-12912from
mm-pr-12912

Conversation

@yashwant86
Copy link
Copy Markdown

@yashwant86 yashwant86 commented Apr 26, 2026

Mirror of upstream Significant-Gravitas#12912 for benchmark. Do not merge.


Summary by MergeMonkey

  • Docs Updates:
    • Updated docstrings and comments for Replicate block cost tracking and LLM/block rate audits.
  • Fresh Additions:
    • Replicate block now emits provider_cost via predict_time metrics for accurate per-second billing instead of flat RUN charge.
  • Fixes & Patches:
    • Corrected stale LLM token rates (Grok, DeepSeek, Mistral, Kimi, Perplexity) to match current pricing.
    • Fixed Unreal Speech block to bill per-character USD instead of flat 5 credits.
    • Adjusted FAL video generator rate from 3 to 15 credits/second to match actual pricing.
    • Normalized COST_USD block margins (Jina, ZeroBounce, Unreal) to consistent 150 cr/$ baseline.
  • Maintenance:
    • Updated .secrets.baseline line numbers and timestamp.

majdyz added 8 commits April 24, 2026 22:35
PR Significant-Gravitas#12909 refresh set GPT-5 to 94/1500 cr/1M which corresponds to a
$0.625/$5 provider rate — that's OpenAI's Batch API tier (50% off
Sync). Most block calls go through the Sync API; the correct Standard
rate is $1.25/$10 per 1M, which at our 1.5x margin = 188/1500 cr/1M.

This was under-billing every GPT-5 call by 2x on input.

Source: https://openai.com/api/pricing — GPT-5 Standard pricing.
Beyond the GPT-5 Standard-vs-Batch fix, verified all TOKEN_COST entries
against each provider's current pricing page. Additional corrections:

- DEEPSEEK_CHAT: 42/63 -> 21/42 (provider unified deepseek-chat +
  deepseek-reasoner to deepseek-v4-flash $0.14/$0.28 in Sept 2025)
- DEEPSEEK_R1_0528: 82/329 -> 21/42 (same v4-flash routing)
- MISTRAL_LARGE_3: 300/900 -> 75/225 (Mistral dropped to $0.50/$1.50)
- MISTRAL_NEMO: 3/6 -> 23/23 (was severely under-billing; provider is
  $0.15 flat for both input and output)
- KIMI_K2_0905: 82/330 -> 90/375 (matches current K2-0905 $0.60/$2.50)
- META_LLAMA_4_MAVERICK: 30/90 -> 75/116 (Groq prices $0.50/$0.77;
  note Groq deprecated this 2026-02-20 — consider retiring enum)

Provider sources: openai.com/api/pricing, api-docs.deepseek.com,
mistral.ai/pricing, platform.kimi.ai/docs/pricing, groq.com/pricing.
Cross-verified via agent-browser for JS-rendered docs.x.ai + DeepSeek.

All 40 cost-pipeline unit tests pass.
Full audit against provider pricing pages uncovered 10 more stale
entries beyond the LLM token rates:

Under-billing (was losing money):
- AIVideoGeneratorBlock (FAL): SECOND 3 -> 15 cr/s
  (provider is $0.05-$0.30/s depending on tier; 3 cr only covered
  $0.02/s models)
- CreateTalkingAvatarVideoBlock (D-ID): RUN 15 -> 100 cr
  (D-ID charges $5.90/min; 15 cr was ~10x under for a median 10-sec
  clip at $0.98 real cost)
- Nano Banana Pro / Nano Banana 2 (3 blocks each): RUN 14 -> 21 cr
  (provider $0.14/image, 14 cr was under cost-of-goods)

Over-billing (normalizing margin to 1.5x baseline):
- IdeogramModelBlock default: RUN 16 -> 12 cr
- IdeogramModelBlock V_3: RUN 18 -> 14 cr
- AIImageEditorBlock FLUX_KONTEXT_MAX: RUN 20 -> 12 cr
- ValidateEmailsBlock (ZeroBounce): COST_USD 250 -> 150 cr/$
- SearchTheWebBlock (Jina): COST_USD 100 -> 150 cr/$
- GetLinkedinProfilePictureBlock: RUN 3 -> 1 cr

Tests updated to match new FAL 15 cr/s rate (was 3 cr/s in 2 tests).

Sources: replicate.com, fal.ai, d-id.com, ideogram.ai, zerobounce.net,
jina.ai. Cross-verified via agent-browser for JS-rendered docs.x.ai
(Grok prices already correct at 300/900 for Grok 4.20 @ $2/$6).
…k legacy doc

- KIMI_K2_5: 90/450 -> 66/300 (OpenRouter pass-through $0.44/$2)
- KIMI_K2_6: 143/600 -> 112/698 (OpenRouter pass-through $0.7448/$4.655)
- UnrealTextToSpeechBlock: RUN 5 cr -> COST_USD 150 cr/$. Block now
  computes USD from len(text) * $0.000016 (Unreal Speech $16/1M chars)
  and emits cost_usd via merge_stats. Long narrations no longer under-bill.
- Grok legacy (grok-3, grok-4-0709, grok-4-fast, grok-code-fast-1):
  rates were already correct at their launch pricing; added inline
  comment noting the docs.x.ai page no longer lists them publicly but
  the API + historical rates remain valid.
…enRouter floor

ReplicateModelBlock takes ANY model ref as input. Flat 10 cr/run
was 10-500x under-billing long video/LLM runs ($1-$50+) and 20x
over-billing tiny SDXL. Block now uses predictions.async_create +
async_wait to read prediction.metrics.predict_time after completion,
emits (predict_time * $0.0014/s) as provider_cost, billed at
COST_USD 150 cr/$. $0.0014/s is the Nvidia L40S mid-tier rate where
most popular public models run.

Also: MISTRAL_LARGE_3 and MISTRAL_NEMO in TOKEN_COST are the safety
floor for OpenRouter-routed calls (ModelMetadata.provider =
'open_router'). Rates now match OpenRouter's pass-through pricing
instead of Mistral-direct's /v1/chat rates, which we never call.
Addresses Sentry bug prediction on MISTRAL_NEMO being 'higher than
actual cost from OpenRouter'.

- ReplicateModelBlock: RUN 10 -> COST_USD 150 cr/$ (dynamic billing)
- ReplicateFluxAdvancedModelBlock: unchanged (bounded to Flux models
  $0.04-$0.08, flat 10 cr stays within 1.25-2.5x margin)
- MISTRAL_LARGE_3: 75/225 -> 300/900 (OpenRouter $2/$6)
- MISTRAL_NEMO: 23/23 -> 5/5 (OpenRouter $0.035/$0.035)
…ling path

Adds 7 unit tests for the refactored run_model:
- Uses version= keyword when model_ref has ':' (pinned version)
- Uses model= keyword otherwise (unpinned 'owner/name')
- Emits provider_cost = predict_time * $0.0014/s via merge_stats
- async_wait is awaited before reading metrics
- Skips merge_stats when metrics missing OR predict_time is 0
  (avoids silent wallet-free leak if SDK quirks return empty metrics)
- Sanity-checks _REPLICATE_USD_PER_SEC is in the Replicate hardware
  tier range ($0.0005-$0.002/s)

SDK surface confirmed against installed replicate==* package:
- Predictions.async_create(model=, version=, input=) — matches
- Prediction.metrics is Optional[Dict] — matches
- Prediction.async_wait exists — matches
…lling

async_wait() returns normally regardless of prediction terminal status
— only async_run raises ModelError on 'failed'. Without an explicit
status check we'd bill partial compute time on a failed run, yield
empty output via extract_result(None), and hardcode 'status: succeeded'
hiding the failure.

Check prediction.status after async_wait and raise before merge_stats
so failures surface as exceptions (caught by run() and re-raised as
BlockExecutionError). Also guard against output=None on succeeded
predictions (type-narrowing for extract_result).

Addresses CodeRabbit critical on Significant-Gravitas#12912.
…ates

Three tests on CI were still asserting old values:
- UnrealTextToSpeech tests assumed provider_cost=len(text) with type
  'characters'. Updated to assert provider_cost=len(text)*$0.000016
  with type 'cost_usd' per the text_to_speech_block migration.
- ZeroBounce ValidateEmailsBlock cost_amount test assumed 250, now 150
  after the margin alignment in this PR.
@bot-mergemonkey
Copy link
Copy Markdown

bot-mergemonkey Bot commented Apr 26, 2026

Risk AssessmentCRITICAL · ~45 min review

Focus areas: Replicate block status/output validation order · LLM rate accuracy (DeepSeek, Kimi, Mistral, Grok) · COST_USD margin normalization (150 cr/$ baseline justification) · FAL video 5× rate increase verification

Assessment: Refactors billing logic for Replicate block and audits/corrects LLM+block rates across 10+ models.

Walkthrough

User calls ReplicateModelBlock.run_model() with a model reference and inputs. The block now parses the reference to determine if it's version-pinned (contains ':') and calls predictions.async_create with either version= or model= keyword. After awaiting async_wait(), it checks prediction.status for 'failed' or 'canceled' and raises if found. If metrics.predict_time exists and is non-zero, it emits provider_cost = predict_time * $0.0014/sec as cost_usd via merge_stats. Finally, it extracts and returns the output. The COST_USD resolver then bills ceil(provider_cost * 150) credits.

Changes

Files Summary
Replicate Block Cost Tracking Refactor
autogpt_platform/backend/backend/blocks/replicate/replicate_block.py, replicate_block_cost_test.py
Migrated ReplicateModelBlock from flat RUN billing to dynamic COST_USD via predictions.async_create + predict_time metrics. Emits provider_cost = predict_time * $0.0014/sec; handles version-pinned vs unpinned refs, status validation, and gracefully skips billing when metrics unavailable.
Unreal Speech Block Cost Migration
autogpt_platform/backend/backend/blocks/text_to_speech_block.py
Changed Unreal Speech billing from flat 5 credits to per-character USD ($0.000016/char). Block now emits provider_cost = len(text) * 0.000016 with cost_usd type for proportional billing.
Block Cost Configuration Audit
autogpt_platform/backend/backend/data/block_cost_config.py
Audited and corrected stale LLM rates (Grok, DeepSeek, Mistral, Kimi, Perplexity). Normalized COST_USD block margins to 150 cr/$ baseline (Jina 100→150, ZeroBounce 250→150, Unreal 5→150). Updated FAL video rate from 3 to 15 credits/second. Added detailed pricing comments for transparency.
Test Updates for Cost Changes
autogpt_platform/backend/backend/blocks/block_cost_tracking_test.py
autogpt_platform/backend/backend/data/block_cost_config_test.py
autogpt_platform/backend/backend/executor/block_usage_cost_test.py
autogpt_platform/backend/backend/copilot/tools/helpers_test.py
Updated test assertions to reflect new cost models: Unreal Speech character-based USD billing, FAL video 15 cr/s rate, ZeroBounce 150 cr/$ margin, and Replicate provider_cost emissions.
Secrets Baseline Maintenance
.secrets.baseline
Updated line number reference and timestamp due to code additions in replicate_block.py.

Sequence Diagram

sequenceDiagram
    participant User
    participant ReplicateBlock
    participant ReplicateClient
    participant Prediction
    participant StatsResolver
    User->>ReplicateBlock: run_model(model_ref, inputs, api_key)
    ReplicateBlock->>ReplicateBlock: Parse model_ref for ':'
    alt version-pinned
        ReplicateBlock->>ReplicateClient: predictions.async_create(version=...)
    else unpinned
        ReplicateBlock->>ReplicateClient: predictions.async_create(model=...)
    end
    ReplicateClient-->>Prediction: return Prediction object
    ReplicateBlock->>Prediction: async_wait()
    Prediction-->>ReplicateBlock: metrics populated
    ReplicateBlock->>ReplicateBlock: Check status
    alt status == 'failed' or 'canceled'
        ReplicateBlock-->>User: raise RuntimeError
    else status == 'succeeded'
        alt metrics.predict_time exists and > 0
            ReplicateBlock->>ReplicateBlock: merge_stats(provider_cost=predict_time*0.0014, cost_usd)
            ReplicateBlock->>StatsResolver: emit NodeExecutionStats
        end
        ReplicateBlock->>ReplicateBlock: extract_result(prediction.output)
        ReplicateBlock-->>User: return result
    end
Loading

Dig Deeper With Commands

  • /review <file-path> <function-optional>
  • /chat <file-path> "<question>"
  • /roast <file-path>

Runs only when explicitly triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants