Article Extraction Outcome Classifier

A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy

Model Description

This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.

Classes

Class Description
full_article_extracted Complete article successfully extracted
partial_article_extracted Article partially extracted (incomplete)
api_provider_error External API/service failure
other_failure Low-confidence failure (catch-all)
full_page_not_article Page is not an article (nav, homepage, etc.)

Performance

~90% accuracy on a large, real-world test set, with strong performance on dominant classes

Class Precision Recall F1-score Support
full_article_extracted 0.91 0.84 0.87 1,312
partial_article_extracted 0.76 0.63 0.69 92
api_provider_error 0.95 0.93 0.94 627
other_failure 0.41 0.28 0.33 44
full_page_not_article 0.92 0.97 0.94 11,821
Accuracy โ€” โ€” 0.90 13,852
Macro Avg 0.79 0.73 0.72 13,852
Weighted Avg 0.90 0.90 0.90 13,852
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler

# Load model
artifacts = torch.load("artifacts.pt")
scaler = artifacts["scaler"]
model = artifacts["xgb_model"]
id_to_label = artifacts["id_to_label"]

# Extract features (26 features from HTML prefix)
def extract_features(html: str, max_chars: int = 64000) -> dict:
    prefix = html[:max_chars].lower()
    
    features = {
        "length_chars": len(html),
        "prefix_len": len(prefix),
        "ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
        "digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
        "punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
        # Keyword counts
        "cookie": prefix.count("cookie") + prefix.count("consent"),
        "subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
        "legal": prefix.count("privacy policy") + prefix.count("terms of service"),
        "error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
        "nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
        "article_kw": prefix.count("published") + prefix.count("reading time"),
        "meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
        # Tag counts
        "n_p": prefix.count("<p"),
        "n_a": prefix.count("<a"),
        "n_h1": prefix.count("<h1"),
        "n_h2": prefix.count("<h2"),
        "n_h3": prefix.count("<h3"),
        "n_article": prefix.count("<article"),
        "n_main": prefix.count("<main"),
        "n_time": prefix.count("<time"),
        "n_script": prefix.count("<script"),
        "n_style": prefix.count("<style"),
        "n_nav": prefix.count("<nav"),
    }
    
    # Density features
    kb = len(prefix) / 1000.0
    features["link_density"] = features["n_a"] / kb if kb > 0 else 0
    features["para_density"] = features["n_p"] / kb if kb > 0 else 0
    features["script_density"] = features["n_script"] / kb if kb > 0 else 0
    features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]
    
    return features

# Predict
features = extract_features(html_string)
NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
            "cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
            "n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
            "n_script", "n_style", "n_nav", "link_density", "para_density", 
            "script_density", "heading_score"]

X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
X_scaled = scaler.transform(X)
prediction = model.predict(X_scaled)[0]

print(f"Outcome: {id_to_label[prediction]}")

Optional: Rule-Based Fast Path

For 80%+ of cases, you can skip the model entirely:

def apply_rules(features: dict) -> str | None:
    """Returns class label or None if ambiguous."""
    if features["error"] >= 3:
        return "api_provider_error"
    
    if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
        return "full_article_extracted"
    
    if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
        return "full_page_not_article"
    
    return None  # Use ML model

# Try rules first
rule_result = apply_rules(features)
if rule_result:
    print(f"Outcome (rule-based): {rule_result}")
else:
    # Fall back to model
    prediction = model.predict(X_scaled)[0]
    print(f"Outcome (model): {id_to_label[prediction]}")

Training Data

  • Dataset: Allanatrix/articles (194,183 HTML pages)
  • Labeled samples: 138,523 (LLM-labeled)
  • Labeling method: Distillation from large language models
    • Primary teacher: GPT-5
    • Secondary / adjudicator: Qwen
  • Train/Val/Test split: 110,819 / 13,852 / 13,852
  • Class distribution: ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles

Model Details

  • Algorithm: XGBoost (GPU-trained)
  • Features: 26 hand-crafted features (HTML structure, keyword counts, density metrics)
  • Training: 500 boosting rounds with early stopping
  • Hardware: Single GPU (CUDA)
  • Training time: ~6 minutes

Features Used

  • Content statistics: length, whitespace ratio, digit and punctuation ratios
  • Keyword signals: error messages, article indicators, navigation text
  • HTML structure: paragraph, link, heading, script, style, and nav tag counts
  • Density metrics: links/KB, paragraphs/KB, scripts/KB, heading score

Limitations

  • Only analyzes the first 64KB of HTML (important metadata must appear early)
  • Labels are generated by LLMs rather than direct human annotation
  • Some classes (e.g. other_failure) have limited representation
  • Optimized primarily for English-language web pages

Intended Use

Primary use cases:

  • Quality control for article extraction pipelines
  • Monitoring extraction API health and failure modes
  • Fast filtering of non-article pages before downstream processing
  • Analytics on extraction success and failure rates

Not suitable for:

  • Language detection
  • Content quality assessment
  • Paywall detection
  • Full content extraction

Model Card Authors

Allanatrix

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support