cesun
/

cbllm-classification

Text Classification

Transformers

Model card Files Files and versions

xet

Community

Improve model card: Add project page, abstract, key results, and comprehensive tags

by nielsr HF Staff - opened Sep 9, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+61

-3

Files changed (1) hide show

README.md +61 -3

README.md CHANGED Viewed

@@ -1,9 +1,67 @@
 ---
-license: apache-2.0
 library_name: transformers
 pipeline_tag: text-classification
 ---
-This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992).
-Code: https://github.com/Trustworthy-ML-Lab/CB-LLMs

 ---
 library_name: transformers
+license: apache-2.0
 pipeline_tag: text-classification
+tags:
+  - text-generation
+  - interpretable-ai
+  - concept-bottleneck
+  - llm
 ---
+# Concept Bottleneck Large Language Models
+This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025.
+-   **Paper:** [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992)
+-   **Project Page:** [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/)
+-   **Code:** [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs)
+## Abstract
+We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.
+## Usage
+For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs).
+## Key Results
+### Part I: CB-LLM (classification)
+CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC).
+| Accuracy ↑           | SST2   | YelpP   | AGnews  | DBpedia  |
+|-----------------------|--------|---------|---------|----------|
+| **Ours:**            |        |         |         |          |\
+| CB-LLM               | 0.9012 | 0.9312  | 0.9009  | 0.9831   |\
+| CB-LLM w/ ACC        | **0.9407** | **<span style="color:blue">0.9806</span>** | **0.9453** | **<span style="color:blue">0.9928</span>** |\
+| **Baselines:**       |        |         |         |          |\
+| TBM&C³M              | 0.9270 | 0.9534  | 0.8972  | 0.9843   |\
+| Roberta-base fine-tuned (black-box) | 0.9462 | 0.9778  | 0.9508  | 0.9917   |
+### Part II: CB-LLM (generation)
+The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).
+| Method                         | Metric           | SST2    | YelpP  | AGnews  | DBpedia |
+|---------------------------------|------------------|---------|--------|---------|---------|\
+| **CB-LLM (Ours)**               | Accuracy↑        | 0.9638  | **0.9855** | 0.9439  | 0.9924  |\
+|                                 | Steerability↑    | **0.82** | **0.95**  | **0.85**  | **0.76**  |\
+|                                 | Perplexity↓      | 116.22  | 13.03  | 18.25   | 37.59   |\
+| **CB-LLM w/o ADV training**     | Accuracy↑        | 0.9676  | 0.9830  | 0.9418  | **0.9934** |\
+|                                 | Steerability↑    | 0.57    | 0.69    | 0.52    | 0.21    |\
+|                                 | Perplexity↓      | **59.19** | 12.39   | 17.93   | **35.13** |\
+| **Llama3 finetuned (black-box)**| Accuracy↑        | **0.9692** | 0.9851  | **0.9493** | 0.9919  |\
+|                                 | Steerability↑    | No      | No      | No      | No      |\
+|                                 | Perplexity↓      | 84.70   | **6.62**  | **12.52** | 41.50   |
+## Citation
+If you find this work useful, please cite the paper:
+```bibtex
+@article{cbllm,
+   title={Concept Bottleneck Large Language Models},
+   author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
+   journal={ICLR},
+   year={2025}
+}
+```