#1: Chatbot Arenaのデータを使ってドメイン独自の評価データセットを作る - Listen -

#1: Chatbot Arenaのデータを使ってドメイン独自の評価データセットを作る

Listen now

Description

Chatbot Arenaのデータを使ってドメイン独自の評価データセットを作るという論文、Judging LLM-as-a-Judge with MT-Bench and Chatbot Arenaを題材に話しました。ポッドキャストの書き起こしサービス「LISTEN」はこちら Shownotes： Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference Chat with Open Large Language Models From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline | LMSYS Org Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge https://x.com/karpathy/status/1737544497016578453 https://github.com/lm-sys/arena-hard-auto/tree/main/BenchBuilder 出演者： seya(@sekikazu01) kagaya(@ry0_kaga)

More Episodes

See all »

#10: Agent-as-a-judge 〜エージェントの評価を行うエージェント〜

LLM-as-a-Judgeに着想を得て、エージェンティックシステムを評価するためにエージェンティックシステムを用いることを提案したAgent-as-a-Judge: Evaluate Agents with...

Published 11/18/24

AI Engineering Now

Published 11/18/24

#9: 今流行り!?の社内v0開発に取り組んでみてる感想

Ubie社の事例に触発されて社内v0開発を始めた2人で、開発の知見や悩み、Figma AI等のデザインAIについて話しましたポッドキャストの書き起こしサービス「LISTEN」はこちら Shownotes: https://v0.dev/ ⁠https://www.figma.com/ja-jp/ai/ https://x.com/sys1yagi/status/1850763720630387170 出演者： seya(@sekikazu01) kagaya(@ry0_kaga)

Published 11/14/24