view reply You can use a VLM if you really want an API. That being said, there are a lot of OCR models that can run on CPU (albeit slow)
view reply 50b per token isn't very efficient... Wonder if we could make this 4: https://huggingface.co/inclusionAI/Ling-1T/blob/main/config.json#L22