API Best Practices

Token Estimation

A rough rule of thumb: 1 token equals ~4 characters of English text, or ~1-2 Chinese characters. For code, 1 token is typically 2-4 characters. Most chat messages of a few sentences consume 100-500 tokens depending on length and language.

If you're building a cost-sensitive application, count tokens before sending requests using tiktoken or a similar library. OpenAI's tokenizer also works well for DeepSeek since both use similar tokenization schemes.

System Prompts and Token Efficiency

System prompts are powerful but they consume input tokens on every request. Keep them concise. If you have a complex, reusable instruction set, consider:

Using a short system prompt with specific task instructions in the user message
Summarizing long conversations before sending to the API
Using cache-hit tokens for repeated patterns in system prompts

Example efficient system prompt:

You are a concise technical assistant. Respond directly. No preamble.

Handling Rate Limits

DeepSeek imposes rate limits per token. When you hit HTTP 429, implement exponential backoff:

import time, random
for attempt in range(5):
response = call_api(...)
if response.status_code != 429:
break
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)

Never spin-wait in a loop. Always use exponential backoff with jitter.

Choosing Between deepseek-chat and deepseek-reasoner

deepseek-chat (v4-flash): General-purpose chat. Fast, cost-effective. Use for: code generation, content writing, Q&A, summarization, translation, most tasks.

deepseek-reasoner (v4-flash reasoning): Extended chain-of-thought reasoning. Slower but more thorough. Use for: complex multi-step problems, mathematical proofs, strategic analysis, debugging tricky issues.

For most production applications, deepseek-chat is the better default choice.

Context Management for Long Conversations

DeepSeek supports up to 1M token context windows, but longer contexts cost more and can reduce response quality. For ongoing conversations:

Periodically summarize and truncate conversation history
Use sliding window approaches (keep last N messages)
For retrieval-augmented tasks, fetch relevant context on-demand rather than preloading everything

This keeps costs predictable and improves model focus.

Error Handling

Common error codes and how to handle them:

HTTP 400 Bad Request: Malformed JSON or invalid parameters. Check your request body.
HTTP 401 Unauthorized: Invalid or expired token. Check your API key.
HTTP 429 Too Many Requests: Rate limit hit. Back off and retry.
HTTP 500/502/503 Server Error: DeepSeek-side issue. Retry after a delay.

Always log errors with full context (request parameters, error body) for debugging.

FAQ

What's the difference between deepseek-chat and deepseek-reasoner?

deepseek-chat is the standard fast model. deepseek-reasoner uses extended chain-of-thought for complex reasoning tasks but costs the same.

How do I monitor my usage?

Use the /v4/usage endpoint or visit the Order Lookup page. Each API call consumes tokens from your balance.

Can I use streaming responses?

Yes. Pass 'stream': true in your request body and handle the SSE stream in your client.

Is there a way to cap maximum spend?

Currently not built-in. We recommend monitoring your usage regularly and topping up as needed.