Whisper Large V3 Fine-tuned on Nepali (OpenSLR 54)

This is a State-of-the-Art Nepali ASR model, fine-tuned on the OpenSLR 54 dataset (~154 hours). It utilizes the massive whisper-large-v3 architecture (1.55 Billion parameters) to achieve high accuracy on complex vocabulary, numbers, and dates.

Model Details

Model: Whisper Large V3 (1.55B Parameters)
Dataset: ~~157,000 Nepali Audio Utterances (~~154 Hours)
Language: Nepali
Fine-tuning Hardware: NVIDIA A100 80GB

Metrics

Final WER: 22.31%
Validation Loss: 0.0927

Note: While the raw WER is higher than the Medium model, the Large model demonstrates superior handling of numbers, dates, and English loanwords.

Usage

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="Dragneel/whisper-large-v3-nepali-openslr", device="cuda")

# Transcribe
transcription = transcriber("path_to_nepali_audio.mp3")
print(transcription["text"])

This research was supported by the High Performance Computing (HPC) facility at Tribhuvan University, Nepal. We acknowledge the Supercomputer Centre for providing the computational resources required for this wor

Downloads last month: 26

Safetensors

Model size

2B params

Tensor type

F32

Model tree for Dragneel/whisper-large-v3-nepali-openslr

Base model

openai/whisper-large-v3

Finetuned

(667)

this model

Dataset used to train Dragneel/whisper-large-v3-nepali-openslr

Evaluation results

Wer on OpenSLR 54 (Nepali Speech Corpus)
self-reported

22.310