Quick Start

This is a LoRA adapter and cannot be loaded directly with AutoModel. Load it as follows:

from transformers import Qwen2VLForConditionalGeneration
from peft import PeftModel

# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Amirhossein75/qwen2-vl-2b-mmhs150k-lora")

Model Card for Model ID

Model Details

Model Description

  • multimodal
  • vision-language
  • hate-speech
  • Developed by: [More Information Needed]
  • Developed by: Amirhossein Yousefi

Qwen2-VL LoRA adapter for MMHS150K hateful content classification

This repository contains a LoRA adapter fine-tuned on MMHS150K (Multi-Modal Hate Speech) for multi-label hateful content detection from paired text + image inputs.

The approach follows the project at https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm: instead of training a classification head, the model is prompted to generate a strict JSON array of labels, which is then parsed and scored as multi-label predictions.

Model Details

  • Developed by: Amirhossein Yousefi
  • Model type: LoRA adapter (PEFT) for Qwen2-VL
  • Base model: Qwen/Qwen2-VL-2B-Instruct
  • Task: Multi-label classification via JSON generation (text + image → label list)
  • Labels: racist, sexist, homophobe, religion, otherhate
  • Repository (training code + methodology): https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm

Intended Use

Direct use

  • Hateful content classification for research/experimentation on MMHS150K-like data.
  • Produces a JSON array of zero or more labels from the fixed label set above.

Out-of-scope use

  • Moderation decisions without human review.
  • Domains/languages far from MMHS150K without further validation.

Bias, Risks, and Limitations

  • This model is trained on hate-speech related data; outputs can be sensitive and may reflect dataset/model biases.
  • Generative classification can fail to follow formatting (non-JSON, extra text); downstream code should do robust parsing.
  • The label set is fixed; forcing predictions outside this taxonomy is unsupported.

How to Use

Load the adapter (PEFT)

import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

base_id = "Qwen/Qwen2-VL-2B-Instruct"
adapter_id = "Amirhossein75/qwen2-vl-2b-mmhs150k-lora"  # this repo

processor = AutoProcessor.from_pretrained(base_id)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    base_id,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)

image = Image.open("path/to/image.jpg").convert("RGB")
text = "Some text to analyze"

labels = ["racist", "sexist", "homophobe", "religion", "otherhate"]
system = "Return JSON only."
user = (
    "Given the image and text, return a JSON array containing zero or more of these labels: "
    + ", ".join([f"\"{l}\"" for l in labels])
)

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": [
        {"type": "text", "text": user + "\n\nText: " + text},
        {"type": "image", "image": image},
    ]},
]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt", padding=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(processor.decode(out[0], skip_special_tokens=True))

Training Data

  • Dataset: MMHS150K (Multi-Modal Hate Speech)
  • Expected format (from the associated code repo): CSV with text, image_path, labels and an images/ directory.

Training Procedure

  • Method: LoRA (PEFT)
  • LoRA config (from adapter config): rank r=4, lora_alpha=32, lora_dropout=0.05, target modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
  • Objective: Causal LM with instruction prompting; classification is obtained by constrained JSON generation.

Hardware Used

As reported in the associated training/evaluation repository (see link above), the Qwen2-VL + LoRA/QLoRA runs were trained on:

  • GPU: NVIDIA GeForce RTX 3080 Laptop GPU (16GB)
  • Platform: Local Windows
  • Notes: NVIDIA driver 581.57, CUDA 13.0 (per nvidia-smi)

Evaluation

Metrics follow the associated code repo: multi-label scores computed from generated JSON labels.

  • Validation (this adapter): micro F1 0.6172, macro F1 0.5077, subset accuracy 0.4366, hamming loss 0.14276
  • Test (this adapter): micro F1 0.6110, macro F1 0.4992

License

  • Training/inference code referenced above is released under MIT in the upstream repository.
  • This repository contains an adapter trained from a base model; please follow the base model’s license/terms (Qwen/Qwen2-VL-2B-Instruct) when using the weights.
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/qwen2-vl-2b-mmhs150k-lora

Base model

Qwen/Qwen2-VL-2B
Adapter
(108)
this model