Safetensors
clip_vision_model

Improve model card: Add pipeline tag, library, abstract & usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +109 -5
README.md CHANGED
@@ -1,14 +1,30 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - kakaobrain/coyo-700m
 
 
 
5
  ---
6
 
7
- [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- This model is trained using the COYO700M dataset. The results below are from linear probe evaluations, demonstrating the model's performance on various benchmarks.
 
10
 
11
- | Dataset | CLIP | MLCD |
12
  |-----------|------|------|
13
  | Food101 | 88.8 | <span style="color:red">90.2</span> |
14
  | CIFAR10 | 95.1 | <span style="color:red">96.9</span> |
@@ -22,4 +38,92 @@ This model is trained using the COYO700M dataset. The results below are from lin
22
  | Pets | 90.0 | <span style="color:red">93.6</span> |
23
  | Cal101 | 93.0 | <span style="color:red">97.7</span> |
24
  | Flowers | 96.9 | <span style="color:red">98.8</span> |
25
- | ImageNet | 76.1 | <span style="color:red">79.1</span> |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - kakaobrain/coyo-700m
4
+ license: apache-2.0
5
+ pipeline_tag: image-feature-extraction
6
+ library_name: transformers
7
  ---
8
 
9
+ # MVT-1.5 RICE: Region-based Cluster Discrimination for Visual Representation Learning
10
+
11
+ [[Paper]](https://arxiv.org/abs/2507.20025) [[GitHub]](https://github.com/deepglint/unicom)
12
+
13
+ ![RICE Model Highlights](https://github.com/user-attachments/assets/e0de38b3-b20a-491e-9382-1839e9968481)
14
+
15
+ ## Model Description
16
+ This model, **RICE (Region-Aware Cluster Discrimination)**, introduces a novel method to enhance region-level visual and OCR capabilities in visual representation learning. While recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation.
17
+
18
+ RICE addresses this gap by first constructing a billion-scale candidate region dataset and proposing a Region Transformer layer to extract rich regional semantics. It further designs a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data.
19
+
20
+ Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). RICE efficiently processes diverse semantic regions within the image using a single forward pass, jointly capturing both general visual semantics (objects) and OCR semantics (texts), seamlessly integrating them into a unified representation.
21
+
22
+ This model is trained using the COYO700M dataset.
23
 
24
+ ## Evaluation Results
25
+ The results below are from linear probe evaluations, demonstrating the model's performance on various benchmarks.
26
 
27
+ | Dataset | CLIP | RICE |
28
  |-----------|------|------|
29
  | Food101 | 88.8 | <span style="color:red">90.2</span> |
30
  | CIFAR10 | 95.1 | <span style="color:red">96.9</span> |
 
38
  | Pets | 90.0 | <span style="color:red">93.6</span> |
39
  | Cal101 | 93.0 | <span style="color:red">97.7</span> |
40
  | Flowers | 96.9 | <span style="color:red">98.8</span> |
41
+ | ImageNet | 76.1 | <span style="color:red">79.1</span> |
42
+
43
+ ## How to use
44
+
45
+ ### 1. Standard Usage (Using `unicom` library)
46
+
47
+ ```python
48
+ # Install dependencies
49
+ # pip install torch transformers
50
+ # git clone https://github.com/deepglint/unicom
51
+ # cd unicom/mlcd
52
+
53
+ from vit_rope2d_hf import MLCDVisionModel
54
+ from transformers import CLIPImageProcessor
55
+ from PIL import Image
56
+ import requests
57
+ import torch
58
+
59
+ # Load model and processor
60
+ model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
61
+ processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
62
+
63
+ # Load and process an image
64
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
65
+ image = Image.open(requests.get(url, stream=True).raw)
66
+ inputs = processor(images=image, return_tensors="pt")
67
+
68
+ # Extract visual features
69
+ with torch.no_grad():
70
+ outputs = model(**inputs)
71
+ features = outputs.last_hidden_state
72
+
73
+ print(f"Extracted features shape: {features.shape}")
74
+ ```
75
+
76
+ ### 2. Using HuggingFace Transformers (>= 4.51.3)
77
+
78
+ ```python
79
+ # pip install torch transformers>=4.51.3
80
+
81
+ from transformers import AutoProcessor, AutoModel
82
+ from PIL import Image
83
+ import requests
84
+ import torch
85
+
86
+ # Load model and processor
87
+ model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
88
+ processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
89
+
90
+ # Load and process an image
91
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
92
+ image = Image.open(requests.get(url, stream=True).raw)
93
+ inputs = processor(images=image, return_tensors="pt")
94
+
95
+ # Extract visual features
96
+ with torch.no_grad():
97
+ outputs = model(**inputs)
98
+ features = outputs.last_hidden_state[0]
99
+
100
+ print(f"Extracted features shape: {features.shape}")
101
+ ```
102
+
103
+ ## Visualize Semantic Features
104
+
105
+ ![Semantic Features Visualization](https://github.com/user-attachments/assets/0ff3b764-c5b6-4a10-a63c-89ccbc99d06b)
106
+
107
+ Using 2048-resolution images as input to a ViT-B/16 model, we project token features onto RGB channels via PCA to visualize the semantic structure. Sequential frames (arranged vertically) illustrate the evolution of model attention, consistently highlighting salient objects across time. The visualization reveals stable color patterns for tracked entities such as ice skaters, deers, motorcyclists, and cyclists, demonstrating the model’s ability to maintain semantic focus throughout the sequence.
108
+
109
+ ## ModelZoo
110
+
111
+ | Model | Download |
112
+ |-------|-------------|
113
+ | RICE-ViT-L-14-560px | [huggingface](https://huggingface.co/DeepGlint-AI/rice-vit-large-patch14-560) |
114
+ | MLCD-ViT-bigG-14-448px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448) |
115
+ | MLCD-ViT-L-14-336px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) |
116
+ | MLCD-ViT-B-32-224px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224) |
117
+
118
+ The authors are from DeepGlint team and Huawei London Research Institute.
119
+
120
+ ## Citation
121
+ If you find our work helpful or inspiring, please feel free to cite it.
122
+ ```latex
123
+ @inproceedings{yinxie_2025_rice,
124
+ title={Region-based Cluster Discrimination for Visual Representation Learning},
125
+ author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
126
+ booktitle={ICCV},
127
+ year={2025}
128
+ }
129
+ ```