StepFun has been focused on multimodal AI from the very beginning. Their latest release a new foundational model: STEP3-VL🔥 https://huggingface.co/collections/stepfun-ai/step3-vl-10b ✨ 10B - Apache2.0 ✨ Leads in the 10B class and competes with models 10–20× larger
✨ Hybrid Architecture: combined autoregressive + diffusion design delivers strong semantic alignment with high-fidelity details ✨ Strong performance in long, dense, and multilingual text rendering ✨ MIT licensed (VQ tokenizer & ViT weights under Apache 2.0) ✨ Now live on Hugging Face inference provider 🤗