Building 37 TRUEBench ๐ฅ 37 Explore and compare language model performance across categories and languages