Description
Chatbot Arenaのデータを使ってドメイン独自の評価データセットを作るという論文、Judging LLM-as-a-Judge with MT-Bench and Chatbot Arenaを題材に話しました。
ポッドキャストの書き起こしサービス「LISTEN」はこちら
Shownotes:
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chat with Open Large Language Models
From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline | LMSYS Org
Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
https://x.com/karpathy/status/1737544497016578453
https://github.com/lm-sys/arena-hard-auto/tree/main/BenchBuilder
出演者:
seya(@sekikazu01)
kagaya(@ry0_kaga)