Improve model card: Add project page, abstract, key results, and comprehensive tags

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +61 -3
README.md CHANGED
@@ -1,9 +1,67 @@
1
  ---
2
- license: apache-2.0
3
  library_name: transformers
 
4
  pipeline_tag: text-classification
 
 
 
 
 
5
  ---
6
 
7
- This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- Code: https://github.com/Trustworthy-ML-Lab/CB-LLMs
 
 
 
 
 
 
 
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
  pipeline_tag: text-classification
5
+ tags:
6
+ - text-generation
7
+ - interpretable-ai
8
+ - concept-bottleneck
9
+ - llm
10
  ---
11
 
12
+ # Concept Bottleneck Large Language Models
13
+
14
+ This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025.
15
+
16
+ - **Paper:** [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992)
17
+ - **Project Page:** [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/)
18
+ - **Code:** [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs)
19
+
20
+ ## Abstract
21
+ We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.
22
+
23
+ ## Usage
24
+
25
+ For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs).
26
+
27
+ ## Key Results
28
+
29
+ ### Part I: CB-LLM (classification)
30
+ CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC).
31
+
32
+ | Accuracy ↑ | SST2 | YelpP | AGnews | DBpedia |
33
+ |-----------------------|--------|---------|---------|----------|
34
+ | **Ours:** | | | | |\
35
+ | CB-LLM | 0.9012 | 0.9312 | 0.9009 | 0.9831 |\
36
+ | CB-LLM w/ ACC | **0.9407** | **<span style="color:blue">0.9806</span>** | **0.9453** | **<span style="color:blue">0.9928</span>** |\
37
+ | **Baselines:** | | | | |\
38
+ | TBM&C³M | 0.9270 | 0.9534 | 0.8972 | 0.9843 |\
39
+ | Roberta-base fine-tuned (black-box) | 0.9462 | 0.9778 | 0.9508 | 0.9917 |
40
+
41
+ ### Part II: CB-LLM (generation)
42
+ The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).
43
+
44
+ | Method | Metric | SST2 | YelpP | AGnews | DBpedia |
45
+ |---------------------------------|------------------|---------|--------|---------|---------|\
46
+ | **CB-LLM (Ours)** | Accuracy↑ | 0.9638 | **0.9855** | 0.9439 | 0.9924 |\
47
+ | | Steerability↑ | **0.82** | **0.95** | **0.85** | **0.76** |\
48
+ | | Perplexity↓ | 116.22 | 13.03 | 18.25 | 37.59 |\
49
+ | **CB-LLM w/o ADV training** | Accuracy↑ | 0.9676 | 0.9830 | 0.9418 | **0.9934** |\
50
+ | | Steerability↑ | 0.57 | 0.69 | 0.52 | 0.21 |\
51
+ | | Perplexity↓ | **59.19** | 12.39 | 17.93 | **35.13** |\
52
+ | **Llama3 finetuned (black-box)**| Accuracy↑ | **0.9692** | 0.9851 | **0.9493** | 0.9919 |\
53
+ | | Steerability↑ | No | No | No | No |\
54
+ | | Perplexity↓ | 84.70 | **6.62** | **12.52** | 41.50 |
55
+
56
+ ## Citation
57
+
58
+ If you find this work useful, please cite the paper:
59
 
60
+ ```bibtex
61
+ @article{cbllm,
62
+ title={Concept Bottleneck Large Language Models},
63
+ author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
64
+ journal={ICLR},
65
+ year={2025}
66
+ }
67
+ ```