Confusion1398 commited on
Commit
b6971dc
Β·
verified Β·
1 Parent(s): 0168313

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +255 -1
README.md CHANGED
@@ -3,4 +3,258 @@ license: apache-2.0
3
  base_model:
4
  - PaddlePaddle/PaddleOCR-VL
5
  base_model_relation: quantized
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  base_model:
4
  - PaddlePaddle/PaddleOCR-VL
5
  base_model_relation: quantized
6
+ ---
7
+ # deepseek-ocr.rs πŸš€
8
+
9
+ Rust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages multiple OCR backends, prompt tooling, and a serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or (alpha) NVIDIA CUDA GPUs.
10
+
11
+ > δΈ­ζ–‡ζ–‡ζ‘£θ―·ηœ‹ [README_CN.md](README_CN.md)。
12
+
13
+ > Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the [build-binaries workflow artifacts](https://github.com/TimmyOVO/deepseek-ocr.rs/actions/workflows/build-binaries.yml). Grab them from the newest green run.
14
+
15
+ ## Choosing a Model πŸ”¬
16
+
17
+ | Model | Memory footprint* | Best on | When to pick it |
18
+ | --- | --- | --- | --- |
19
+ | **DeepSeek‑OCR** | **β‰ˆ6.3β€―GB** FP16 weights, **β‰ˆ13β€―GB** RAM/VRAM with cache & activations (512-token budget) | Apple Silicon + Metal (FP16), high-VRAM NVIDIA GPUs, 32β€―GB+ RAM desktops | Highest accuracy, SAM+CLIP global/local context, MoE DeepSeek‑V2 decoder (3β€―B params, ~570β€―M active per token). Use when latency is secondary to quality. |
20
+ | **PaddleOCR‑VL** | **β‰ˆ4.7β€―GB** FP16 weights, **β‰ˆ9β€―GB** RAM/VRAM with cache & activations | 16β€―GB laptops, CPU-only boxes, mid-range GPUs | Dense 0.9β€―B Ernie decoder with SigLIP vision tower. Faster startup, lower memory, great for batch jobs or lightweight deployments. |
21
+
22
+ \*Measured from the default FP16 safetensors. Runtime footprint varies with sequence length.
23
+
24
+ Guidance:
25
+
26
+ - **Need maximum fidelity, multi-region reasoning, or already have 16–24β€―GB VRAM?** Use **DeepSeek‑OCR**. The hybrid SAM+CLIP tower plus DeepSeek‑V2 MoE decoder handles complex layouts best, but expect higher memory/latency.
27
+ - **Deploying to CPU-only nodes, 16β€―GB laptops, or latency-sensitive services?** Choose **PaddleOCR‑VL**. Its dense Ernie decoder (18 layers, hidden 1024) activates fewer parameters per token and keeps memory under 10β€―GB while staying close in quality on most docs.
28
+
29
+ ## Why Rust? πŸ’‘
30
+
31
+ The original DeepSeek-OCR ships as a Python + Transformers stackβ€”powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:
32
+
33
+ - Smaller deployable artifacts with zero Python runtime or conda baggage.
34
+ - Memory-safe, thread-friendly infrastructure that blends into native Rust backends.
35
+ - Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.
36
+ - Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.
37
+
38
+ ## Technical Stack βš™οΈ
39
+
40
+ - **Candle** for tensor compute, with Metal and CUDA backends and FlashAttention support.
41
+ - **Rocket** + async streaming for OpenAI-compatible `/v1/responses` and `/v1/chat/completions`.
42
+ - **tokenizers** (upstream DeepSeek release) wrapped by `crates/assets` for deterministic caching via Hugging Face and ModelScope mirrors.
43
+ - **Pure Rust vision/prompt pipeline** shared by CLI and server to avoid duplicated logic.
44
+
45
+ ## Advantages over the Python Release πŸ₯·
46
+
47
+ - Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.
48
+ - Deterministic dual-source (Hugging Face + ModelScope) asset download + verification built into the workspace.
49
+ - Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.
50
+ - Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.
51
+
52
+ ## Highlights ✨
53
+
54
+ - **One repo, two entrypoints** – a batteries-included CLI for batch jobs and a Rocket-based server that speaks `/v1/responses` and `/v1/chat/completions`.
55
+ - **Works out of the box** – pulls model weights, configs, and tokenizer from whichever of Hugging Face or ModelScope responds fastest on first run.
56
+ - **Optimised for Apple Silicon** – optional Metal backend with FP16 execution for real-time OCR on laptops.
57
+ - **CUDA (alpha)** – experimental support via `--features cuda` + `--device cuda --dtype f16`; expect rough edges while we finish kernel coverage.
58
+ - **Intel MKL (preview)** – faster BLAS on x86 via `--features mkl` (install Intel oneMKL beforehand).
59
+ - **OpenAI client compatibility** – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.
60
+
61
+ ## Quick Start 🏁
62
+
63
+ ### Prerequisites
64
+
65
+ - Rust 1.78+ (edition 2024 support)
66
+ - Git
67
+ - Optional: Apple Silicon running macOS 13+ for Metal acceleration
68
+ - Optional: CUDA 12.2+ toolkit + driver for experimental NVIDIA GPU acceleration on Linux/Windows
69
+ - Optional: Intel oneAPI MKL for preview x86 acceleration (see below)
70
+ - (Recommended) Hugging Face account with `HF_TOKEN` when pulling from the `deepseek-ai/DeepSeek-OCR` repo (ModelScope is used automatically when it’s faster/reachable).
71
+
72
+ ### Clone the Workspace
73
+
74
+ ```bash
75
+ git clone https://github.com/TimmyOVO/deepseek-ocr.rs.git
76
+ cd deepseek-ocr.rs
77
+ cargo fetch
78
+ ```
79
+
80
+ ### Model Assets
81
+
82
+ The first invocation of the CLI or server downloads the config, tokenizer, and `model-00001-of-000001.safetensors` (~6.3GB) into `DeepSeek-OCR/`. To prefetch manually:
83
+
84
+ ```bash
85
+ cargo run -p deepseek-ocr-cli --release -- --help # dev profile is extremely slow; always prefer --release
86
+ ```
87
+
88
+ > Always include `--release` when running from source; debug builds on this model are extremely slow.
89
+ Set `HF_HOME`/`HF_TOKEN` if you store Hugging Face caches elsewhere (ModelScope downloads land alongside the same asset tree). The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).
90
+
91
+ ## Configuration & Overrides πŸ—‚οΈ
92
+
93
+ The CLI and server share the same configuration. On first launch we create a `config.toml` populated with defaults; later runs reuse it so both entrypoints stay in sync.
94
+
95
+ | Platform | Config file (default) | Model cache root |
96
+ | --- | --- | --- |
97
+ | Linux | `~/.config/deepseek-ocr/config.toml` | `~/.cache/deepseek-ocr/models/<id>/…` |
98
+ | macOS | `~/Library/Application Support/deepseek-ocr/config.toml` | `~/Library/Caches/deepseek-ocr/models/<id>/…` |
99
+ | Windows | `%APPDATA%\deepseek-ocr\config.toml` | `%LOCALAPPDATA%\deepseek-ocr\models\<id>\…` |
100
+
101
+ - Override the location with `--config /path/to/config.toml` (available on both CLI and server). Missing files are created automatically.
102
+ - Each `[models.entries."<id>"]` record can point to custom `config`, `tokenizer`, or `weights` files. When omitted we fall back to the cache directory above and download/update assets as required.
103
+ - Runtime values resolve in this order: command-line flags β†’ values stored in `config.toml` β†’ built-in defaults. The HTTP API adds a final layer where request payload fields (for example `max_tokens`) override everything else for that call.
104
+
105
+ The generated file starts with the defaults below; adjust them to persistently change behaviour:
106
+
107
+ ```toml
108
+ [models]
109
+ active = "deepseek-ocr"
110
+
111
+ [models.entries.deepseek-ocr]
112
+
113
+ [inference]
114
+ device = "cpu"
115
+ template = "plain"
116
+ base_size = 1024
117
+ image_size = 640
118
+ crop_mode = true
119
+ max_new_tokens = 512
120
+ use_cache = true
121
+
122
+ [server]
123
+ host = "0.0.0.0"
124
+ port = 8000
125
+ ```
126
+
127
+ - `[models]` picks the active model and lets you add more entries (each entry can point to its own config/tokenizer/weights).
128
+ - `[inference]` controls notebook-friendly defaults shared by the CLI and server (device, template, vision sizing, decoding budget, cache usage).
129
+ - `[server]` sets the network binding and the model identifier reported by `/v1/models`.
130
+
131
+ See `crates/cli/README.md` and `crates/server/README.md` for concise override tables.
132
+
133
+ ## Benchmark Snapshot πŸ“Š
134
+
135
+ Single-request Rust CLI (Accelerate backend on macOS) compared with the reference Python pipeline on the same prompt and image:
136
+
137
+ | Stage | ref total (ms) | ref avg (ms) | python total | python/ref |
138
+ |---------------------------------------------------|----------------|--------------|--------------|------------|
139
+ | Decode – Overall (`decode.generate`) | 30077.840 | 30077.840 | 56554.873 | 1.88x |
140
+ | Decode – Token Loop (`decode.iterative`) | 26930.216 | 26930.216 | 39227.974 | 1.46x |
141
+ | Decode – Prompt Prefill (`decode.prefill`) | 3147.337 | 3147.337 | 5759.684 | 1.83x |
142
+ | Prompt – Build Tokens (`prompt.build_tokens`) | 0.466 | 0.466 | 45.434 | 97.42x |
143
+ | Prompt – Render Template (`prompt.render`) | 0.005 | 0.005 | 0.019 | 3.52x |
144
+ | Vision – Embed Images (`vision.compute_embeddings`)| 6391.435 | 6391.435 | 3953.459 | 0.62x |
145
+ | Vision – Prepare Inputs (`vision.prepare_inputs`) | 62.524 | 62.524 | 45.438 | 0.73x |
146
+
147
+ ## Command-Line Interface πŸ–₯️
148
+
149
+ Build and run directly from the workspace:
150
+
151
+ ```bash
152
+ cargo run -p deepseek-ocr-cli --release -- \
153
+ --prompt "<image>\n<|grounding|>Convert this receipt to markdown." \
154
+ --image baselines/sample/images/test.png \
155
+ --device cpu --max-new-tokens 512
156
+ ```
157
+
158
+ > Tip: `--release` is required for reasonable throughput; debug builds can be 10x slower.
159
+
160
+ > macOS tip: append `--features metal` to the `cargo run`/`cargo build` commands to compile with Accelerate + Metal backends.
161
+ >
162
+ > CUDA tip (Linux/Windows): append `--features cuda` and run with `--device cuda --dtype f16` to target NVIDIA GPUsβ€”feature is still alpha, so be ready for quirks.
163
+ >
164
+ > Intel MKL preview: install Intel oneMKL, then build with `--features mkl` for faster CPU matmuls on x86.
165
+
166
+ Install the CLI as a binary:
167
+
168
+ ```bash
169
+ cargo install --path crates/cli
170
+ deepseek-ocr-cli --help
171
+ ```
172
+
173
+ Key flags:
174
+
175
+ - `--prompt` / `--prompt-file`: text with `<image>` slots
176
+ - `--image`: path(s) matching `<image>` placeholders
177
+ - `--device` and `--dtype`: choose `metal` + `f16` on Apple Silicon or `cuda` + `f16` on NVIDIA GPUs
178
+ - `--max-new-tokens`: decoding budget
179
+ - Sampling controls: `--do-sample`, `--temperature`, `--top-p`, `--top-k`, `--repetition-penalty`, `--no-repeat-ngram-size`, `--seed`
180
+ - By default decoding stays deterministic (`do_sample=false`, `temperature=0.0`, `no_repeat_ngram_size=20`)
181
+ - To use stochastic sampling set `--do-sample true --temperature 0.8` (and optionally adjust the other knobs)
182
+
183
+ ### Switching Models
184
+
185
+ The autogenerated `config.toml` now contains two model entries:
186
+
187
+ - `deepseek-ocr` (default) – the original DeepSeek vision-language stack.
188
+ - `paddleocr-vl` – the PaddleOCR-VL 0.9B SigLIP + Ernie release.
189
+
190
+ Pick which one to load via `--model`:
191
+
192
+ ```bash
193
+ deepseek-ocr-cli --model paddleocr-vl --prompt "<image> Summarise"
194
+ ```
195
+
196
+ The CLI (and server) will download the matching config/tokenizer/weights from the appropriate repository (`deepseek-ai/DeepSeek-OCR` or `PaddlePaddle/PaddleOCR-VL`) into your cache on first use. You can still override paths with `--model-config`, `--tokenizer`, or `--weights` if you maintain local fine-tunes.
197
+
198
+ ## HTTP Server ☁️
199
+
200
+ Launch an OpenAI-compatible endpoint:
201
+
202
+ ```bash
203
+ cargo run -p deepseek-ocr-server --release -- \
204
+ --host 0.0.0.0 --port 8000 \
205
+ --device cpu --max-new-tokens 512
206
+ ```
207
+
208
+ > Keep `--release` on the server as well; the debug profile is far too slow for inference workloads.
209
+ > macOS tip: add `--features metal` to the `cargo run -p deepseek-ocr-server` command when you want the server binary to link against Accelerate + Metal (and pair it with `--device metal` at runtime).
210
+ >
211
+ > CUDA tip: add `--features cuda` and start the server with `--device cuda --dtype f16` to offload inference to NVIDIA GPUs (alpha-quality support).
212
+ >
213
+ > Intel MKL preview: install Intel oneMKL before building with `--features mkl` to accelerate CPU workloads on x86.
214
+
215
+ Notes:
216
+
217
+ - Use `data:` URLs or remote `http(s)` links; local paths are rejected.
218
+ - The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.
219
+ - Works out of the box with tools such as [Open WebUI](https://github.com/open-webui/open-webui) or any OpenAI-compatible clientβ€”just point the base URL to your server (`http://localhost:8000/v1`) and select either the `deepseek-ocr` or `paddleocr-vl` model ID exposed in `/v1/models`.
220
+ - Adjust the request body limit with Rocket config if you routinely send large images.
221
+
222
+ ![Open WebUI connected to deepseek-ocr.rs](./baselines/sample_1.png)
223
+
224
+ ## GPU Acceleration ⚑
225
+
226
+ - **Metal (macOS 13+ Apple Silicon)** – pass `--device metal --dtype f16` and build binaries with `--features metal` so Candle links against Accelerate + Metal.
227
+ - **CUDA (alpha, NVIDIA GPUs)** – install CUDA 12.2+ toolkits, build with `--features cuda`, and launch the CLI/server with `--device cuda --dtype f16`; still experimental.
228
+ - **Intel MKL (preview)** – install Intel oneMKL and build with `--features mkl` to speed up CPU workloads on x86.
229
+ - For either backend, prefer release builds (e.g. `cargo build --release -p deepseek-ocr-cli --features metal|cuda`) to maximise throughput.
230
+ - Combine GPU runs with `--max-new-tokens` and crop tuning flags to balance latency vs. quality.
231
+
232
+ ## Repository Layout πŸ—‚οΈ
233
+
234
+ - `crates/core` – shared inference pipeline, model loaders, conversation templates.
235
+ - `crates/cli` – command-line frontend (`deepseek-ocr-cli`).
236
+ - `crates/server` – Rocket server exposing OpenAI-compatible endpoints.
237
+ - `crates/assets` – asset management (configuration, tokenizer, Hugging Face + ModelScope download helpers).
238
+ - `baselines/` – reference inputs and outputs for regression testing.
239
+
240
+ Detailed CLI usage lives in [`crates/cli/README.md`](crates/cli/README.md). The server’s OpenAI-compatible interface is covered in [`crates/server/README.md`](crates/server/README.md).
241
+
242
+ ## Troubleshooting πŸ› οΈ
243
+
244
+ - **Where do assets come from?** – downloads automatically pick between Hugging Face and ModelScope based on latency; the CLI prints the chosen source for each file.
245
+ - **Slow first response** – model load and GPU warm-up (Metal/CUDA alpha) happen on the initial request; later runs are faster.
246
+ - **Large image rejection** – increase Rocket JSON limits in `crates/server/src/main.rs` or downscale the input.
247
+
248
+ ## Roadmap πŸ—ΊοΈ
249
+
250
+ - βœ… Apple Metal backend with FP16 support and CLI/server parity on macOS.
251
+ - βœ… NVIDIA CUDA backend (alpha) – build with `--features cuda`, run with `--device cuda --dtype f16` for Linux/Windows GPUs; polishing in progress.
252
+ - πŸ”„ **Parity polish** – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.
253
+ - πŸ”„ **Grounding & streaming** – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.
254
+ - πŸ”„ **Cross-platform acceleration** – continue tuning CUDA kernels, add automatic device detection across CPU/Metal/CUDA, and publish opt-in GPU benchmarks.
255
+ - πŸ”„ **Packaging & Ops** – ship binary releases with deterministic asset checksums, richer logging/metrics, and Helm/docker references for server deploys.
256
+ - πŸ”œ **Structured outputs** – optional JSON schema tools for downstream automation once parity gaps close.
257
+
258
+ ## License πŸ“„
259
+
260
+ This repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to `DeepSeek-OCR/LICENSE` for model terms and apply the same restrictions to downstream use.