File size: 4,542 Bytes
4ce5366
 
 
 
 
bc58bda
 
 
 
 
 
 
4cf16b5
bc58bda
 
dc1bc05
 
 
bc58bda
4cf16b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc1bc05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
datasets:
- Bingsu/Human_Action_Recognition
library_name: transformers
language:
- en
base_model:
- google/siglip2-base-patch16-224
pipeline_tag: image-classification
tags:
- Human-Action-Recognition
---

![zfdggzdrg.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/DPx-7s4BmG_XocnPQ4TR9.png)
# **Human-Action-Recognition**

> **Human-Action-Recognition** is an image classification vision-language encoder model fine-tuned from **google/siglip2-base-patch16-224** for multi-class human action recognition. It uses the **SiglipForImageClassification** architecture to predict human activities from still images.

```py
Classification Report:
                    precision    recall  f1-score   support

           calling     0.8525    0.7571    0.8020       840
          clapping     0.8679    0.7119    0.7822       840
           cycling     0.9662    0.9857    0.9758       840
           dancing     0.8302    0.8381    0.8341       840
          drinking     0.9093    0.8714    0.8900       840
            eating     0.9377    0.9131    0.9252       840
          fighting     0.9034    0.7905    0.8432       840
           hugging     0.9065    0.9000    0.9032       840
          laughing     0.7854    0.8583    0.8203       840
listening_to_music     0.8494    0.7988    0.8233       840
           running     0.8888    0.9321    0.9099       840
           sitting     0.5945    0.7226    0.6523       840
          sleeping     0.8593    0.8214    0.8399       840
           texting     0.8195    0.6702    0.7374       840
      using_laptop     0.6610    0.9190    0.7689       840

          accuracy                         0.8327     12600
         macro avg     0.8421    0.8327    0.8339     12600
      weighted avg     0.8421    0.8327    0.8339     12600
```

![download.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/O9ir2VwHirB-T75ABCP7m.png)

The model categorizes images into 15 action classes:

- **0:** calling  
- **1:** clapping  
- **2:** cycling  
- **3:** dancing  
- **4:** drinking  
- **5:** eating  
- **6:** fighting  
- **7:** hugging  
- **8:** laughing  
- **9:** listening_to_music  
- **10:** running  
- **11:** sitting  
- **12:** sleeping  
- **13:** texting  
- **14:** using_laptop  

---

# **Run with Transformers 🤗**

```python
!pip install -q transformers torch pillow gradio
```

```python
import gradio as gr
from transformers import AutoImageProcessor, SiglipForImageClassification
from PIL import Image
import torch

# Load model and processor
model_name = "prithivMLmods/Human-Action-Recognition"  # Change to your updated model path
model = SiglipForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)

# ID to Label mapping
id2label = {
    0: "calling",
    1: "clapping",
    2: "cycling",
    3: "dancing",
    4: "drinking",
    5: "eating",
    6: "fighting",
    7: "hugging",
    8: "laughing",
    9: "listening_to_music",
    10: "running",
    11: "sitting",
    12: "sleeping",
    13: "texting",
    14: "using_laptop"
}

def classify_action(image):
    """Predicts the human action in the image."""
    image = Image.fromarray(image).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()

    predictions = {id2label[i]: round(probs[i], 3) for i in range(len(probs))}
    return predictions

# Gradio interface
iface = gr.Interface(
    fn=classify_action,
    inputs=gr.Image(type="numpy"),
    outputs=gr.Label(label="Action Prediction Scores"),
    title="Human Action Recognition",
    description="Upload an image to recognize the human action (e.g., dancing, calling, sitting, etc.)."
)

# Launch the app
if __name__ == "__main__":
    iface.launch()
```

---

# **Intended Use**

The **Human-Action-Recognition** model is designed to detect and classify human actions from images. Example applications:

- **Surveillance & Monitoring:** Recognizing suspicious or specific activities in public spaces.  
- **Sports Analytics:** Identifying player activities or movements.  
- **Social Media Insights:** Understanding trends in user-posted visuals.  
- **Healthcare:** Monitoring elderly or patients for activity patterns.  
- **Robotics & Automation:** Enabling context-aware AI systems with visual understanding.