DeepSeek launched V4 Pro and V4 Flash with thinking mode controls
DeepSeek added deepseek-v4-pro and deepseek-v4-flash to its API model catalog on April 24, 2026. Both models support thinking and non-thinking modes, 1M token context, 384K max output, JSON output, tool calls, and FIM in non-thinking mode. Official pricing lists V4 Flash at $0.14 cache-miss input / $0.028 cache-hit input / $0.28 output per 1M tokens, and V4 Pro at $1.74 cache-miss input / $0.145 cache-hit input / $3.48 output per 1M tokens.
Impact summary
Use explicit levels instead of abstract charts to judge this change.
- New API models: deepseek-v4-flash and deepseek-v4-pro
- Both support thinking and non-thinking modes; thinking defaults to enabled
- OpenAI-format controls: thinking.type enabled/disabled and reasoning_effort high/max
- Anthropic-format effort control: output_config.effort high/max
- Context window: 1M tokens; max output: 384K tokens
- Features: JSON output, tool calls, chat prefix completion, and FIM in non-thinking mode
- Pricing: V4 Flash $0.14 cache-miss input / $0.028 cache-hit input / $0.28 output per 1M tokens
- Pricing: V4 Pro $1.74 cache-miss input / $0.145 cache-hit input / $3.48 output per 1M tokens
- deepseek-chat and deepseek-reasoner remain compatibility aliases for V4 Flash modes and are marked for future deprecation
- Add explicit model IDs instead of relying on deepseek-chat or deepseek-reasoner aliases.
- For agentic coding and complex reasoning, benchmark V4 Pro with reasoning_effort=max against your current frontier route.
- For high-volume and latency-sensitive workloads, test V4 Flash with thinking disabled and enabled separately.
- If using tool calls in thinking mode, persist and pass back reasoning_content on subsequent requests to avoid 400 responses.
- Update budget rules to account for cache-hit and cache-miss input pricing separately.
Changes
Review changed fields first, then decide whether you need full before/after values.
Recommended Actions
Only actionable items stay prominent. Incomplete actions fall back to a compact list.
Add explicit routing entries for deepseek-v4-pro and deepseek-v4-flash instead of relying only on deepseek-chat or deepseek-reasoner aliases.
Benchmark V4 Pro with reasoning_effort=max for agentic coding and complex reasoning before promoting it to a frontier route.
For thinking-mode tool calls, store assistant reasoning_content with tool_calls and pass it back in subsequent requests.
Update cost dashboards to distinguish cache-hit input, cache-miss input, and output tokens for both V4 tiers.
Sources
Each source focuses on what it confirms, without repeating the whole article.
Official DeepSeek social announcement URL for the V4 Pro and V4 Flash launch.
DeepSeek lists deepseek-v4-flash and deepseek-v4-pro with 1M context, 384K max output, cache-hit input, cache-miss input, and output token pricing.
DeepSeek documents thinking mode toggles, reasoning effort controls, reasoning_content behavior, and tool-call handling for deepseek-v4-pro.
DeepSeek describes V4 Pro and V4 Flash as MoE models with 1M token context and configurable non-think, high, and max reasoning modes.