File size: 4,542 Bytes
4ce5366 bc58bda 4cf16b5 bc58bda dc1bc05 bc58bda 4cf16b5 dc1bc05 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: apache-2.0
datasets:
- Bingsu/Human_Action_Recognition
library_name: transformers
language:
- en
base_model:
- google/siglip2-base-patch16-224
pipeline_tag: image-classification
tags:
- Human-Action-Recognition
---

# **Human-Action-Recognition**
> **Human-Action-Recognition** is an image classification vision-language encoder model fine-tuned from **google/siglip2-base-patch16-224** for multi-class human action recognition. It uses the **SiglipForImageClassification** architecture to predict human activities from still images.
```py
Classification Report:
precision recall f1-score support
calling 0.8525 0.7571 0.8020 840
clapping 0.8679 0.7119 0.7822 840
cycling 0.9662 0.9857 0.9758 840
dancing 0.8302 0.8381 0.8341 840
drinking 0.9093 0.8714 0.8900 840
eating 0.9377 0.9131 0.9252 840
fighting 0.9034 0.7905 0.8432 840
hugging 0.9065 0.9000 0.9032 840
laughing 0.7854 0.8583 0.8203 840
listening_to_music 0.8494 0.7988 0.8233 840
running 0.8888 0.9321 0.9099 840
sitting 0.5945 0.7226 0.6523 840
sleeping 0.8593 0.8214 0.8399 840
texting 0.8195 0.6702 0.7374 840
using_laptop 0.6610 0.9190 0.7689 840
accuracy 0.8327 12600
macro avg 0.8421 0.8327 0.8339 12600
weighted avg 0.8421 0.8327 0.8339 12600
```

The model categorizes images into 15 action classes:
- **0:** calling
- **1:** clapping
- **2:** cycling
- **3:** dancing
- **4:** drinking
- **5:** eating
- **6:** fighting
- **7:** hugging
- **8:** laughing
- **9:** listening_to_music
- **10:** running
- **11:** sitting
- **12:** sleeping
- **13:** texting
- **14:** using_laptop
---
# **Run with Transformers 🤗**
```python
!pip install -q transformers torch pillow gradio
```
```python
import gradio as gr
from transformers import AutoImageProcessor, SiglipForImageClassification
from PIL import Image
import torch
# Load model and processor
model_name = "prithivMLmods/Human-Action-Recognition" # Change to your updated model path
model = SiglipForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)
# ID to Label mapping
id2label = {
0: "calling",
1: "clapping",
2: "cycling",
3: "dancing",
4: "drinking",
5: "eating",
6: "fighting",
7: "hugging",
8: "laughing",
9: "listening_to_music",
10: "running",
11: "sitting",
12: "sleeping",
13: "texting",
14: "using_laptop"
}
def classify_action(image):
"""Predicts the human action in the image."""
image = Image.fromarray(image).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()
predictions = {id2label[i]: round(probs[i], 3) for i in range(len(probs))}
return predictions
# Gradio interface
iface = gr.Interface(
fn=classify_action,
inputs=gr.Image(type="numpy"),
outputs=gr.Label(label="Action Prediction Scores"),
title="Human Action Recognition",
description="Upload an image to recognize the human action (e.g., dancing, calling, sitting, etc.)."
)
# Launch the app
if __name__ == "__main__":
iface.launch()
```
---
# **Intended Use**
The **Human-Action-Recognition** model is designed to detect and classify human actions from images. Example applications:
- **Surveillance & Monitoring:** Recognizing suspicious or specific activities in public spaces.
- **Sports Analytics:** Identifying player activities or movements.
- **Social Media Insights:** Understanding trends in user-posted visuals.
- **Healthcare:** Monitoring elderly or patients for activity patterns.
- **Robotics & Automation:** Enabling context-aware AI systems with visual understanding. |