SmolVLM: Redefining small and efficient multimodal models Paper β’ 2504.05299 β’ Published Apr 7, 2025 β’ 205
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation Paper β’ 2501.09755 β’ Published Jan 16, 2025 β’ 35
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision Paper β’ 2407.06189 β’ Published Jul 8, 2024 β’ 27
Open World Object Detection in the Era of Foundation Models Paper β’ 2312.05745 β’ Published Dec 10, 2023 β’ 1
PROB: Probabilistic Objectness for Open World Object Detection Paper β’ 2212.01424 β’ Published Dec 2, 2022
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper β’ 2403.10517 β’ Published Mar 15, 2024 β’ 37