CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Abstract
Multimodal large language models can effectively understand source code when represented as compressed images, achieving significant token reduction while maintaining or improving performance on code comprehension tasks.
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.
Community
Try compressing your code input to LLMs with CodeOCR with up to 8x compression ratio!
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/codeocr-on-the-effectiveness-of-vision-language-models-in-code-understanding
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression (2026)
- VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning (2026)
- Not All Tokens Matter: Data-Centric Optimization for Efficient Code Summarization (2026)
- VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration (2026)
- Global Context Compression with Interleaved Vision-Text Transformation (2026)
- Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR (2026)
- SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper