datasets:
- kakaobrain/coyo-700m
license: apache-2.0
pipeline_tag: image-feature-extraction
library_name: transformers
MVT-1.5 RICE: Region-based Cluster Discrimination for Visual Representation Learning
Model Description
This model, RICE (Region-Aware Cluster Discrimination), introduces a novel method to enhance region-level visual and OCR capabilities in visual representation learning. While recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation.
RICE addresses this gap by first constructing a billion-scale candidate region dataset and proposing a Region Transformer layer to extract rich regional semantics. It further designs a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data.
Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). RICE efficiently processes diverse semantic regions within the image using a single forward pass, jointly capturing both general visual semantics (objects) and OCR semantics (texts), seamlessly integrating them into a unified representation.
This model is trained using the COYO700M dataset.
Evaluation Results
The results below are from linear probe evaluations, demonstrating the model's performance on various benchmarks.
| Dataset | CLIP | RICE |
|---|---|---|
| Food101 | 88.8 | 90.2 |
| CIFAR10 | 95.1 | 96.9 |
| CIFAR100 | 80.5 | 86.8 |
| Birdsnap | 58.5 | 72.1 |
| SUN397 | 76.6 | 77.4 |
| Cars | 81.8 | 93.5 |
| Aircraft | 52.0 | 74.7 |
| VOC2007 | 87.7 | 90.4 |
| DTD | 76.5 | 83.5 |
| Pets | 90.0 | 93.6 |
| Cal101 | 93.0 | 97.7 |
| Flowers | 96.9 | 98.8 |
| ImageNet | 76.1 | 79.1 |
How to use
1. Standard Usage (Using unicom library)
# Install dependencies
# pip install torch transformers
# git clone https://github.com/deepglint/unicom
# cd unicom/mlcd
from vit_rope2d_hf import MLCDVisionModel
from transformers import CLIPImageProcessor
from PIL import Image
import requests
import torch
# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Extract visual features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")
2. Using HuggingFace Transformers (>= 4.51.3)
# pip install torch transformers>=4.51.3
from transformers import AutoProcessor, AutoModel
from PIL import Image
import requests
import torch
# Load model and processor
model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Extract visual features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state[0]
print(f"Extracted features shape: {features.shape}")
Visualize Semantic Features
Using 2048-resolution images as input to a ViT-B/16 model, we project token features onto RGB channels via PCA to visualize the semantic structure. Sequential frames (arranged vertically) illustrate the evolution of model attention, consistently highlighting salient objects across time. The visualization reveals stable color patterns for tracked entities such as ice skaters, deers, motorcyclists, and cyclists, demonstrating the model’s ability to maintain semantic focus throughout the sequence.
ModelZoo
| Model | Download |
|---|---|
| RICE-ViT-L-14-560px | huggingface |
| MLCD-ViT-bigG-14-448px | huggingface |
| MLCD-ViT-L-14-336px | huggingface |
| MLCD-ViT-B-32-224px | huggingface |
The authors are from DeepGlint team and Huawei London Research Institute.
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@inproceedings{yinxie_2025_rice,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
booktitle={ICCV},
year={2025}
}