--- language: - en license: apache-2.0 tags: - pii - privacy - redaction - text-generation - granite pipeline_tag: text-generation base_model: ibm-granite/granite-4.0-h-micro datasets: - ai4privacy/pii-masking-300k metrics: - precision - recall - f1 library_name: transformers --- # Sentinel PII Redaction **State-of-the-art PII detection and redaction model** Sentinel PII Redaction is a specialized language model fine-tuned for identifying and tagging Personally Identifiable Information (PII) in text. Built on IBM's Granite 4.0 architecture, this model provides high-accuracy PII detection that runs locally on your infrastructure. ## Model Overview - **Base Model**: IBM Granite 4.0 Micro (3.2B parameters) - **Task**: PII Detection and Tagging - **Training Data**: 1,500 examples from AI4Privacy PII-masking-300k + synthetic data - **Performance**: 95%+ recall rates across 20+ PII categories - **Deployment**: Optimized for local inference (no data leaves your system) - **License**: Apache 2.0 ## Supported PII Categories The model can identify and tag the following PII categories: ### Identity Information - `PERSON_NAME` - Full names, first names, last names - `USERNAME` - User identifiers - `AGE` - Numerical age - `GENDER` - Gender identifiers - `DEMOGRAPHIC_GROUP` - Race, ethnicity ### Contact Information - `EMAIL_ADDRESS` - Email addresses - `PHONE_NUMBER` - Phone numbers (various formats) - `STREET_ADDRESS` - Physical addresses - `CITY` - City names - `STATE` - State/province names - `POSTCODE` - ZIP/postal codes - `COUNTRY` - Country names ### Dates - `DATE` - General dates - `DATE_OF_BIRTH` - Birth dates ### ID Numbers - `PERSONAL_ID` - SSN, national IDs, subscriber numbers - `PASSPORT` - Passport numbers - `DRIVERLICENSE` - Driver's license numbers - `IDCARD` - ID card numbers - `SOCIALNUMBER` - Social security numbers ### Financial - `CREDIT_CARD_INFO` - Credit card numbers - `BANKING_NUMBER` - Bank account numbers ### Security - `PASSWORD` - Passwords and credentials - `SECURE_CREDENTIAL` - API keys, tokens, private keys ### Medical - `MEDICAL_CONDITION` - Diagnoses, treatments, health information ### Location - `NATIONALITY` - Country of origin/citizenship - `GEOCOORD` - GPS coordinates ### Organization - `ORGANIZATION_NAME` - Company/organization names - `BUILDING` - Building names/numbers ### Other - `DOMAIN_NAME` - Internet domains - `RELIGIOUS_AFFILIATION` - Religious identifiers ## 🚀 Quick Start ### Installation ```bash pip install transformers torch ``` ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained( "coolAI/sentinel-pii-redaction", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("coolAI/sentinel-pii-redaction") # Prepare input text text = "My name is John Smith and my email is john@email.com. I live at 123 Main St, New York, NY 10001." # Create prompt messages = [ { "role": "user", "content": f"Identify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}" } ] # Tokenize inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) # Generate with torch.no_grad(): outputs = model.generate( inputs, max_new_tokens=512, do_sample=False, pad_token_id=tokenizer.eos_token_id ) # Decode output input_length = inputs.size(1) generated_ids = outputs[0][input_length:] response = tokenizer.decode(generated_ids, skip_special_tokens=True) print(response) ``` **Expected Output:** ``` My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]. I live at [STREET_ADDRESS], [CITY], [STATE] [POSTCODE]. ``` ## 📊 Performance Metrics Evaluated on the AI4Privacy PII-masking-300k dataset: ### Category-Specific Recall Rates | Category | Recall | Description | |----------|--------|-------------| | **Critical PII** | | | | PERSONAL_ID | 98.5% | SSN, national IDs | | DATE_OF_BIRTH | 98.2% | Birth dates | | CREDIT_CARD_INFO | 97.8% | Credit card numbers | | PASSWORD | 96.9% | Passwords | | **Identity** | | | | PERSON_NAME | 95.4% | Personal names | | EMAIL_ADDRESS | 97.2% | Email addresses | | PHONE_NUMBER | 96.5% | Phone numbers | | USERNAME | 94.8% | User identifiers | | **Location** | | | | STREET_ADDRESS | 96.5% | Physical addresses | | POSTCODE | 99.3% | ZIP/postal codes | | CITY | 97.6% | City names | | COUNTRY | 96.1% | Country names | | **Medical** | | | | MEDICAL_CONDITION | 93.2% | Health information | | **Organization** | | | | ORGANIZATION_NAME | 94.7% | Company names | *Note: Actual performance may vary based on text format and context.* ## 💡 Use Cases ### 1. Data Sanitization for ML Training Remove PII from datasets before fine-tuning language models: ```python def sanitize_training_data(texts): sanitized = [] for text in texts: redacted = redact_pii(text) sanitized.append(redacted) return sanitized # Use for safe model training clean_data = sanitize_training_data(user_generated_content) ``` ### 2. Compliance & Auditing Ensure GDPR, HIPAA, and CCPA compliance: ```python def audit_document(document): pii_found = detect_pii(document) return { "has_pii": len(pii_found) > 0, "pii_types": list(pii_found.keys()), "redacted_version": redact_pii(document) } ``` ### 3. Privacy Protection in Logs Sanitize application logs before storage or analysis: ```python def safe_logging(log_entry): return redact_pii(log_entry) logger.info(safe_logging(user_action)) ``` ## 🔧 Advanced Usage ### With Custom PII Categories Guide the model by specifying which PII categories to focus on: ```python categories = """ PII Categories to identify: - PERSON_NAME: Names of people - EMAIL_ADDRESS: Email addresses - PHONE_NUMBER: Phone numbers - MEDICAL_CONDITION: Health information - PERSONAL_ID: ID numbers (SSN, passport, etc.) """ messages = [ { "role": "user", "content": f"{categories}\n\nIdentify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}" } ] ``` ### Batch Processing Process multiple texts efficiently: ```python def batch_redact(texts, batch_size=8): results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] # Process batch... results.extend(batch_results) return results ``` ## 📝 Training Details ### Training Data - **AI4Privacy PII-masking-300k**: 1,000 examples - Large-scale, diverse PII examples - Multiple languages and jurisdictions - Human-validated accuracy - **Synthetic Data**: 500 examples - Generated using Faker library - Edge cases and rare PII types - Balanced category representation - **Total**: 1,500 training examples ### Training Configuration ```yaml Base Model: IBM Granite 4.0 Micro (3.2B parameters) Method: LoRA (Low-Rank Adaptation) Trainable Parameters: 38.4M (1.19% of total) Training Hardware: NVIDIA L4 GPU Training Time: ~7 minutes Epochs: 1 Batch Size: 8 (2 × 4 gradient accumulation) Learning Rate: 2e-4 Optimizer: AdamW 8-bit Final Loss: 0.015-0.038 ``` ### Training Framework - **Unsloth**: For efficient fine-tuning - **Transformers**: Model architecture - **PEFT**: LoRA implementation ## Privacy & Security ### Privacy Features - **Local Inference**: Runs entirely on your infrastructure - **No Data Sharing**: No data sent to external APIs or services - **Open Source**: Full transparency in model architecture and training - **Customizable**: Can be further fine-tuned on your specific data - **Offline Capable**: Works without internet connection ### Security Considerations - Model detects but doesn't store PII - Inference happens in-memory - No logging of input/output by default - Can be deployed in air-gapped environments - Supports encrypted storage of model weights ## 📄 License This model is released under the **Apache 2.0** license. You are free to: - Use commercially - Modify and distribute - Use privately - Use for patent purposes ## 🙏 Acknowledgments - Built on **IBM Granite 4.0** architecture - Trained using **AI4Privacy PII-masking-300k** dataset - Powered by **Unsloth** for efficient training - Thanks to the open-source ML community ## 📚 Citation If you use this model in your research or applications, please cite: ```bibtex @misc{sentinel-pii-redaction-2025, author = {coolAI}, title = {Sentinel PII Redaction: High-Accuracy Local PII Detection}, year = {2025}, publisher = {HuggingFace}, journal = {HuggingFace Model Hub}, howpublished = {\url{https://huggingface.co/coolAI/sentinel-pii-redaction}} } ``` **Built with ❤️ for privacy-conscious AI development**