|
|
--- |
|
|
license: mit |
|
|
base_model: MCG-NJU/videomae-base |
|
|
tags: |
|
|
- video-classification |
|
|
- crime-detection |
|
|
- violence-detection |
|
|
- videomae |
|
|
- computer-vision |
|
|
- security |
|
|
- surveillance |
|
|
- generated_from_trainer |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- jinmang2/ucf_crime |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
pipeline_tag: video-classification |
|
|
model-index: |
|
|
- name: test-upload-model |
|
|
results: |
|
|
- task: |
|
|
name: Violence Detection |
|
|
type: video-classification |
|
|
dataset: |
|
|
name: UCF Crime Dataset (Subset) |
|
|
type: jinmang2/ucf_crime |
|
|
args: violence_detection |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 0.5000 |
|
|
- name: Precision |
|
|
type: precision |
|
|
value: 0.2500 |
|
|
- name: Recall |
|
|
type: recall |
|
|
value: 0.5000 |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.3333 |
|
|
--- |
|
|
|
|
|
# Nikeytas/Test Upload Model |
|
|
|
|
|
This model is a fine-tuned version of [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) on the UCF Crime dataset with **event-based binary classification**. It achieves the following results on the evaluation set: |
|
|
|
|
|
- **Loss**: 0.5847 |
|
|
- **Accuracy**: 0.5000 |
|
|
- **Precision**: 0.2500 |
|
|
- **Recall**: 0.5000 |
|
|
- **F1 Score**: 0.3333 |
|
|
|
|
|
## π― Model Overview |
|
|
|
|
|
This VideoMAE model has been fine-tuned for **binary violence detection** in video content. The model classifies videos into two categories: |
|
|
- **Violent Crime** (1): Videos containing violent criminal activities |
|
|
- **Non-Violent Incident** (0): Videos with non-violent or normal activities |
|
|
|
|
|
The model is based on the **VideoMAE architecture** and has been specifically trained on a curated subset of the UCF Crime dataset with event-based categorization for realistic crime detection scenarios. |
|
|
|
|
|
## π Dataset & Training |
|
|
|
|
|
### Dataset Composition |
|
|
|
|
|
**Total Videos**: 20 |
|
|
- **Violent Crime Videos**: 10 |
|
|
- **Non-Violent Incident Videos**: 10 |
|
|
|
|
|
**Class Balance**: 50.0% violent crimes |
|
|
|
|
|
**Event Distribution**: |
|
|
- **Arrest**: 20 videos |
|
|
- **Arson**: 20 videos |
|
|
|
|
|
**Data Splits**: |
|
|
- **Training**: 12 videos |
|
|
- **Validation**: 4 videos |
|
|
- **Test**: 4 videos |
|
|
|
|
|
## π― Performance |
|
|
|
|
|
### Performance Metrics |
|
|
|
|
|
**Validation Performance**: |
|
|
- **eval_loss**: 0.5847 |
|
|
- **eval_accuracy**: 0.5000 |
|
|
- **eval_precision**: 0.2500 |
|
|
- **eval_recall**: 0.5000 |
|
|
- **eval_f1**: 0.3333 |
|
|
- **eval_runtime**: 0.6636 |
|
|
- **eval_samples_per_second**: 6.0270 |
|
|
- **eval_steps_per_second**: 3.0140 |
|
|
- **epoch**: 1.0000 |
|
|
|
|
|
**Test Performance**: |
|
|
- **eval_loss**: 0.6700 |
|
|
- **eval_accuracy**: 0.5000 |
|
|
- **eval_precision**: 0.2500 |
|
|
- **eval_recall**: 0.5000 |
|
|
- **eval_f1**: 0.3333 |
|
|
- **eval_runtime**: 0.4271 |
|
|
- **eval_samples_per_second**: 9.3660 |
|
|
- **eval_steps_per_second**: 4.6830 |
|
|
- **epoch**: 1.0000 |
|
|
|
|
|
**Training Information**: |
|
|
- **Training Time**: 0.1 minutes |
|
|
- **Best Accuracy Achieved**: 0.5000 |
|
|
- **Model Architecture**: VideoMAE Base (fine-tuned) |
|
|
- **Fine-tuning Approach**: Event-based binary classification |
|
|
|
|
|
## π Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- **Learning Rate**: 5e-05 |
|
|
- **Train Batch Size**: 2 |
|
|
- **Eval Batch Size**: 2 |
|
|
- **Optimizer**: AdamW with betas=(0.9,0.999) and epsilon=1e-08 |
|
|
- **LR Scheduler Type**: Linear |
|
|
- **Training Epochs**: 1 |
|
|
- **Weight Decay**: 0.01 |
|
|
|
|
|
### Training Results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | |
|
|
|---------------|-------|------|-----------------|----------| |
|
|
| 0.5 | 1.00 | N/A | 0.5847 | 0.5000 | |
|
|
|
|
|
### Framework Versions |
|
|
|
|
|
- **Transformers**: 4.30.2+ |
|
|
- **PyTorch**: 2.0.1+ |
|
|
- **Datasets**: Latest |
|
|
- **Device**: Apple Silicon MPS / CUDA / CPU (Auto-detected) |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch torchvision opencv-python pillow |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForVideoClassification, AutoProcessor |
|
|
import cv2 |
|
|
import numpy as np |
|
|
|
|
|
# Load model and processor |
|
|
model = AutoModelForVideoClassification.from_pretrained("Nikeytas/test-upload-model") |
|
|
processor = AutoProcessor.from_pretrained("Nikeytas/test-upload-model") |
|
|
|
|
|
# Process video |
|
|
def classify_video(video_path, num_frames=16): |
|
|
# Extract frames |
|
|
cap = cv2.VideoCapture(video_path) |
|
|
frames = [] |
|
|
|
|
|
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) |
|
|
indices = np.linspace(0, total_frames - 1, num_frames, dtype=int) |
|
|
|
|
|
for idx in indices: |
|
|
cap.set(cv2.CAP_PROP_POS_FRAMES, idx) |
|
|
ret, frame = cap.read() |
|
|
if ret: |
|
|
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) |
|
|
frames.append(frame_rgb) |
|
|
|
|
|
cap.release() |
|
|
|
|
|
# Process with model |
|
|
inputs = processor(frames, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=-1).item() |
|
|
confidence = predictions[0][predicted_class].item() |
|
|
|
|
|
label = "Violent Crime" if predicted_class == 1 else "Non-Violent" |
|
|
return label, confidence |
|
|
|
|
|
# Example usage |
|
|
video_path = "path/to/your/video.mp4" |
|
|
prediction, confidence = classify_video(video_path) |
|
|
print(f"Prediction: {prediction} (Confidence: {confidence:.3f})") |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
import os |
|
|
from pathlib import Path |
|
|
|
|
|
def process_video_directory(video_dir, output_file="results.txt"): |
|
|
results = [] |
|
|
|
|
|
for video_file in Path(video_dir).glob("*.mp4"): |
|
|
try: |
|
|
prediction, confidence = classify_video(str(video_file)) |
|
|
results.append({ |
|
|
"file": video_file.name, |
|
|
"prediction": prediction, |
|
|
"confidence": confidence |
|
|
}) |
|
|
print(f"β
{video_file.name}: {prediction} ({confidence:.3f})") |
|
|
except Exception as e: |
|
|
print(f"β Error processing {video_file.name}: {e}") |
|
|
|
|
|
# Save results |
|
|
with open(output_file, "w") as f: |
|
|
for result in results: |
|
|
f.write(f"{result['file']}: {result['prediction']} ({result['confidence']:.3f})\n") |
|
|
|
|
|
return results |
|
|
|
|
|
# Process all videos in a directory |
|
|
results = process_video_directory("./videos/") |
|
|
``` |
|
|
|
|
|
## π Technical Specifications |
|
|
|
|
|
- **Base Model**: MCG-NJU/videomae-base |
|
|
- **Architecture**: Vision Transformer (ViT) adapted for video |
|
|
- **Input Resolution**: 224x224 pixels per frame |
|
|
- **Temporal Resolution**: 16 frames per video clip |
|
|
- **Output Classes**: 2 (Binary classification) |
|
|
- **Training Framework**: HuggingFace Transformers |
|
|
- **Optimization**: AdamW optimizer with learning rate 5e-5 |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
1. **Dataset Scope**: Trained on a subset of UCF Crime dataset - may not generalize to all types of violence |
|
|
2. **Temporal Context**: Uses 16-frame clips which may miss context in longer sequences |
|
|
3. **Environmental Bias**: Performance may vary with different lighting, camera angles, and video quality |
|
|
4. **False Positives**: May misclassify intense but non-violent activities (sports, action movies) |
|
|
5. **Real-time Performance**: Processing time depends on hardware capabilities |
|
|
|
|
|
## π Ethical Considerations |
|
|
|
|
|
### Intended Use |
|
|
- **Primary**: Research and development in video analysis |
|
|
- **Secondary**: Security system enhancement with human oversight |
|
|
- **Educational**: Computer vision and AI safety research |
|
|
|
|
|
### Prohibited Uses |
|
|
- **Surveillance without consent**: Do not use for unauthorized monitoring |
|
|
- **Discriminatory profiling**: Avoid bias against specific groups or communities |
|
|
- **Automated punishment**: Never use for automated legal or disciplinary actions |
|
|
- **Privacy violation**: Respect privacy laws and individual rights |
|
|
|
|
|
### Bias and Fairness |
|
|
- Model trained on specific dataset that may not represent all populations |
|
|
- Regular evaluation needed for bias detection and mitigation |
|
|
- Human oversight required for critical applications |
|
|
- Consider demographic representation in deployment scenarios |
|
|
|
|
|
## π Model Card Information |
|
|
|
|
|
- **Developed by**: Research Team |
|
|
- **Model Type**: Video Classification (Binary) |
|
|
- **Training Data**: UCF Crime Dataset (Subset) |
|
|
- **Training Date**: 2025-06-08 15:19:08 UTC |
|
|
- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score |
|
|
- **Intended Users**: Researchers, Security Professionals, Developers |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{Nikeytas_test_upload_model, |
|
|
title={VideoMAE Fine-tuned for Crime Detection}, |
|
|
author={Research Team}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/Nikeytas/test-upload-model} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
We welcome contributions to improve the model! Please: |
|
|
1. Report issues with specific examples |
|
|
2. Suggest improvements for bias reduction |
|
|
3. Share evaluation results on new datasets |
|
|
4. Contribute to documentation and examples |
|
|
|
|
|
## π Contact |
|
|
|
|
|
For questions, issues, or collaboration opportunities, please open an issue in the model repository or contact the development team. |
|
|
|
|
|
--- |
|
|
|
|
|
*Last updated: 2025-06-08 15:19:08 UTC* |
|
|
*Model version: 1.0* |
|
|
*Framework: HuggingFace Transformers* |
|
|
|