[mirror] fix(block_cost_config): audit + correct stale LLM/block rates + migrate generic ReplicateModelBlock to COST_USD#5
Conversation
PR Significant-Gravitas#12909 refresh set GPT-5 to 94/1500 cr/1M which corresponds to a $0.625/$5 provider rate — that's OpenAI's Batch API tier (50% off Sync). Most block calls go through the Sync API; the correct Standard rate is $1.25/$10 per 1M, which at our 1.5x margin = 188/1500 cr/1M. This was under-billing every GPT-5 call by 2x on input. Source: https://openai.com/api/pricing — GPT-5 Standard pricing.
Beyond the GPT-5 Standard-vs-Batch fix, verified all TOKEN_COST entries against each provider's current pricing page. Additional corrections: - DEEPSEEK_CHAT: 42/63 -> 21/42 (provider unified deepseek-chat + deepseek-reasoner to deepseek-v4-flash $0.14/$0.28 in Sept 2025) - DEEPSEEK_R1_0528: 82/329 -> 21/42 (same v4-flash routing) - MISTRAL_LARGE_3: 300/900 -> 75/225 (Mistral dropped to $0.50/$1.50) - MISTRAL_NEMO: 3/6 -> 23/23 (was severely under-billing; provider is $0.15 flat for both input and output) - KIMI_K2_0905: 82/330 -> 90/375 (matches current K2-0905 $0.60/$2.50) - META_LLAMA_4_MAVERICK: 30/90 -> 75/116 (Groq prices $0.50/$0.77; note Groq deprecated this 2026-02-20 — consider retiring enum) Provider sources: openai.com/api/pricing, api-docs.deepseek.com, mistral.ai/pricing, platform.kimi.ai/docs/pricing, groq.com/pricing. Cross-verified via agent-browser for JS-rendered docs.x.ai + DeepSeek. All 40 cost-pipeline unit tests pass.
Full audit against provider pricing pages uncovered 10 more stale entries beyond the LLM token rates: Under-billing (was losing money): - AIVideoGeneratorBlock (FAL): SECOND 3 -> 15 cr/s (provider is $0.05-$0.30/s depending on tier; 3 cr only covered $0.02/s models) - CreateTalkingAvatarVideoBlock (D-ID): RUN 15 -> 100 cr (D-ID charges $5.90/min; 15 cr was ~10x under for a median 10-sec clip at $0.98 real cost) - Nano Banana Pro / Nano Banana 2 (3 blocks each): RUN 14 -> 21 cr (provider $0.14/image, 14 cr was under cost-of-goods) Over-billing (normalizing margin to 1.5x baseline): - IdeogramModelBlock default: RUN 16 -> 12 cr - IdeogramModelBlock V_3: RUN 18 -> 14 cr - AIImageEditorBlock FLUX_KONTEXT_MAX: RUN 20 -> 12 cr - ValidateEmailsBlock (ZeroBounce): COST_USD 250 -> 150 cr/$ - SearchTheWebBlock (Jina): COST_USD 100 -> 150 cr/$ - GetLinkedinProfilePictureBlock: RUN 3 -> 1 cr Tests updated to match new FAL 15 cr/s rate (was 3 cr/s in 2 tests). Sources: replicate.com, fal.ai, d-id.com, ideogram.ai, zerobounce.net, jina.ai. Cross-verified via agent-browser for JS-rendered docs.x.ai (Grok prices already correct at 300/900 for Grok 4.20 @ $2/$6).
…k legacy doc - KIMI_K2_5: 90/450 -> 66/300 (OpenRouter pass-through $0.44/$2) - KIMI_K2_6: 143/600 -> 112/698 (OpenRouter pass-through $0.7448/$4.655) - UnrealTextToSpeechBlock: RUN 5 cr -> COST_USD 150 cr/$. Block now computes USD from len(text) * $0.000016 (Unreal Speech $16/1M chars) and emits cost_usd via merge_stats. Long narrations no longer under-bill. - Grok legacy (grok-3, grok-4-0709, grok-4-fast, grok-code-fast-1): rates were already correct at their launch pricing; added inline comment noting the docs.x.ai page no longer lists them publicly but the API + historical rates remain valid.
…enRouter floor ReplicateModelBlock takes ANY model ref as input. Flat 10 cr/run was 10-500x under-billing long video/LLM runs ($1-$50+) and 20x over-billing tiny SDXL. Block now uses predictions.async_create + async_wait to read prediction.metrics.predict_time after completion, emits (predict_time * $0.0014/s) as provider_cost, billed at COST_USD 150 cr/$. $0.0014/s is the Nvidia L40S mid-tier rate where most popular public models run. Also: MISTRAL_LARGE_3 and MISTRAL_NEMO in TOKEN_COST are the safety floor for OpenRouter-routed calls (ModelMetadata.provider = 'open_router'). Rates now match OpenRouter's pass-through pricing instead of Mistral-direct's /v1/chat rates, which we never call. Addresses Sentry bug prediction on MISTRAL_NEMO being 'higher than actual cost from OpenRouter'. - ReplicateModelBlock: RUN 10 -> COST_USD 150 cr/$ (dynamic billing) - ReplicateFluxAdvancedModelBlock: unchanged (bounded to Flux models $0.04-$0.08, flat 10 cr stays within 1.25-2.5x margin) - MISTRAL_LARGE_3: 75/225 -> 300/900 (OpenRouter $2/$6) - MISTRAL_NEMO: 23/23 -> 5/5 (OpenRouter $0.035/$0.035)
…ling path Adds 7 unit tests for the refactored run_model: - Uses version= keyword when model_ref has ':' (pinned version) - Uses model= keyword otherwise (unpinned 'owner/name') - Emits provider_cost = predict_time * $0.0014/s via merge_stats - async_wait is awaited before reading metrics - Skips merge_stats when metrics missing OR predict_time is 0 (avoids silent wallet-free leak if SDK quirks return empty metrics) - Sanity-checks _REPLICATE_USD_PER_SEC is in the Replicate hardware tier range ($0.0005-$0.002/s) SDK surface confirmed against installed replicate==* package: - Predictions.async_create(model=, version=, input=) — matches - Prediction.metrics is Optional[Dict] — matches - Prediction.async_wait exists — matches
…lling async_wait() returns normally regardless of prediction terminal status — only async_run raises ModelError on 'failed'. Without an explicit status check we'd bill partial compute time on a failed run, yield empty output via extract_result(None), and hardcode 'status: succeeded' hiding the failure. Check prediction.status after async_wait and raise before merge_stats so failures surface as exceptions (caught by run() and re-raised as BlockExecutionError). Also guard against output=None on succeeded predictions (type-narrowing for extract_result). Addresses CodeRabbit critical on Significant-Gravitas#12912.
…ates Three tests on CI were still asserting old values: - UnrealTextToSpeech tests assumed provider_cost=len(text) with type 'characters'. Updated to assert provider_cost=len(text)*$0.000016 with type 'cost_usd' per the text_to_speech_block migration. - ZeroBounce ValidateEmailsBlock cost_amount test assumed 250, now 150 after the margin alignment in this PR.
⚡ Risk Assessment —
|
| Files | Summary |
|---|---|
Replicate Block Cost Tracking Refactorautogpt_platform/backend/backend/blocks/replicate/replicate_block.py, replicate_block_cost_test.py |
Migrated ReplicateModelBlock from flat RUN billing to dynamic COST_USD via predictions.async_create + predict_time metrics. Emits provider_cost = predict_time * $0.0014/sec; handles version-pinned vs unpinned refs, status validation, and gracefully skips billing when metrics unavailable. |
Unreal Speech Block Cost Migrationautogpt_platform/backend/backend/blocks/text_to_speech_block.py |
Changed Unreal Speech billing from flat 5 credits to per-character USD ($0.000016/char). Block now emits provider_cost = len(text) * 0.000016 with cost_usd type for proportional billing. |
Block Cost Configuration Auditautogpt_platform/backend/backend/data/block_cost_config.py |
Audited and corrected stale LLM rates (Grok, DeepSeek, Mistral, Kimi, Perplexity). Normalized COST_USD block margins to 150 cr/$ baseline (Jina 100→150, ZeroBounce 250→150, Unreal 5→150). Updated FAL video rate from 3 to 15 credits/second. Added detailed pricing comments for transparency. |
Test Updates for Cost Changesautogpt_platform/backend/backend/blocks/block_cost_tracking_test.pyautogpt_platform/backend/backend/data/block_cost_config_test.pyautogpt_platform/backend/backend/executor/block_usage_cost_test.pyautogpt_platform/backend/backend/copilot/tools/helpers_test.py |
Updated test assertions to reflect new cost models: Unreal Speech character-based USD billing, FAL video 15 cr/s rate, ZeroBounce 150 cr/$ margin, and Replicate provider_cost emissions. |
Secrets Baseline Maintenance.secrets.baseline |
Updated line number reference and timestamp due to code additions in replicate_block.py. |
Sequence Diagram
sequenceDiagram
participant User
participant ReplicateBlock
participant ReplicateClient
participant Prediction
participant StatsResolver
User->>ReplicateBlock: run_model(model_ref, inputs, api_key)
ReplicateBlock->>ReplicateBlock: Parse model_ref for ':'
alt version-pinned
ReplicateBlock->>ReplicateClient: predictions.async_create(version=...)
else unpinned
ReplicateBlock->>ReplicateClient: predictions.async_create(model=...)
end
ReplicateClient-->>Prediction: return Prediction object
ReplicateBlock->>Prediction: async_wait()
Prediction-->>ReplicateBlock: metrics populated
ReplicateBlock->>ReplicateBlock: Check status
alt status == 'failed' or 'canceled'
ReplicateBlock-->>User: raise RuntimeError
else status == 'succeeded'
alt metrics.predict_time exists and > 0
ReplicateBlock->>ReplicateBlock: merge_stats(provider_cost=predict_time*0.0014, cost_usd)
ReplicateBlock->>StatsResolver: emit NodeExecutionStats
end
ReplicateBlock->>ReplicateBlock: extract_result(prediction.output)
ReplicateBlock-->>User: return result
end
Dig Deeper With Commands
/review <file-path> <function-optional>/chat <file-path> "<question>"/roast <file-path>
Runs only when explicitly triggered.
Mirror of upstream Significant-Gravitas#12912 for benchmark. Do not merge.
Summary by MergeMonkey