DeepGlint-AI
/

mlcd-vit-base-patch32-224

Safetensors

clip_vision_model

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library, abstract & usage

by nielsr HF Staff - opened Jul 29

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+109

-5

Files changed (1) hide show

README.md +109 -5

README.md CHANGED Viewed

@@ -1,14 +1,30 @@
 ---
-license: apache-2.0
 datasets:
 - kakaobrain/coyo-700m
 ---
-[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
-This model is trained using the COYO700M dataset. The results below are from linear probe evaluations, demonstrating the model's performance on various benchmarks.
-| Dataset   | CLIP | MLCD |
 |-----------|------|------|
 | Food101   | 88.8 | <span style="color:red">90.2</span> |
 | CIFAR10   | 95.1 | <span style="color:red">96.9</span> |
@@ -22,4 +38,92 @@ This model is trained using the COYO700M dataset. The results below are from lin
 | Pets      | 90.0 | <span style="color:red">93.6</span> |
 | Cal101    | 93.0 | <span style="color:red">97.7</span> |
 | Flowers   | 96.9 | <span style="color:red">98.8</span> |
-| ImageNet  | 76.1 | <span style="color:red">79.1</span> |

 ---
 datasets:
 - kakaobrain/coyo-700m
+license: apache-2.0
+pipeline_tag: image-feature-extraction
+library_name: transformers
 ---
+# MVT-1.5 RICE: Region-based Cluster Discrimination for Visual Representation Learning
+[[Paper]](https://arxiv.org/abs/2507.20025) [[GitHub]](https://github.com/deepglint/unicom)
+![RICE Model Highlights](https://github.com/user-attachments/assets/e0de38b3-b20a-491e-9382-1839e9968481)
+## Model Description
+This model, **RICE (Region-Aware Cluster Discrimination)**, introduces a novel method to enhance region-level visual and OCR capabilities in visual representation learning. While recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation.
+RICE addresses this gap by first constructing a billion-scale candidate region dataset and proposing a Region Transformer layer to extract rich regional semantics. It further designs a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data.
+Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). RICE efficiently processes diverse semantic regions within the image using a single forward pass, jointly capturing both general visual semantics (objects) and OCR semantics (texts), seamlessly integrating them into a unified representation.
+This model is trained using the COYO700M dataset.
+## Evaluation Results
+The results below are from linear probe evaluations, demonstrating the model's performance on various benchmarks.
+| Dataset   | CLIP | RICE |
 |-----------|------|------|
 | Food101   | 88.8 | <span style="color:red">90.2</span> |
 | CIFAR10   | 95.1 | <span style="color:red">96.9</span> |
 | Pets      | 90.0 | <span style="color:red">93.6</span> |
 | Cal101    | 93.0 | <span style="color:red">97.7</span> |
 | Flowers   | 96.9 | <span style="color:red">98.8</span> |
+| ImageNet  | 76.1 | <span style="color:red">79.1</span> |
+## How to use
+### 1. Standard Usage (Using `unicom` library)
+```python
+# Install dependencies
+# pip install torch transformers
+# git clone https://github.com/deepglint/unicom
+# cd unicom/mlcd
+from vit_rope2d_hf import MLCDVisionModel
+from transformers import CLIPImageProcessor
+from PIL import Image
+import requests
+import torch
+# Load model and processor
+model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
+processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
+# Load and process an image
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = processor(images=image, return_tensors="pt")
+# Extract visual features
+with torch.no_grad():
+    outputs = model(**inputs)
+features = outputs.last_hidden_state
+print(f"Extracted features shape: {features.shape}")
+```
+### 2. Using HuggingFace Transformers (>= 4.51.3)
+```python
+# pip install torch transformers>=4.51.3
+from transformers import AutoProcessor, AutoModel
+from PIL import Image
+import requests
+import torch
+# Load model and processor
+model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
+processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
+# Load and process an image
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = processor(images=image, return_tensors="pt")
+# Extract visual features
+with torch.no_grad():
+    outputs = model(**inputs)
+features = outputs.last_hidden_state[0]
+print(f"Extracted features shape: {features.shape}")
+```
+## Visualize Semantic Features
+![Semantic Features Visualization](https://github.com/user-attachments/assets/0ff3b764-c5b6-4a10-a63c-89ccbc99d06b)
+Using 2048-resolution images as input to a ViT-B/16 model, we project token features onto RGB channels via PCA to visualize the semantic structure. Sequential frames (arranged vertically) illustrate the evolution of model attention, consistently highlighting salient objects across time. The visualization reveals stable color patterns for tracked entities such as ice skaters, deers, motorcyclists, and cyclists, demonstrating the model’s ability to maintain semantic focus throughout the sequence.
+## ModelZoo
+| Model | Download |
+|-------|-------------|
+| RICE-ViT-L-14-560px | [huggingface](https://huggingface.co/DeepGlint-AI/rice-vit-large-patch14-560) |
+| MLCD-ViT-bigG-14-448px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448) |
+| MLCD-ViT-L-14-336px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) |
+| MLCD-ViT-B-32-224px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224) |
+The authors are from DeepGlint team and Huawei London Research Institute.
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it.
+```latex
+@inproceedings{yinxie_2025_rice,
+  title={Region-based Cluster Discrimination for Visual Representation Learning},
+  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
+  booktitle={ICCV},
+  year={2025}
+}
+```