ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)
Listen now
Description
Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE. This episode is straightforwardly a part 2 to our ICLR 2024 Part 1 episode, so without further ado, we’ll just get right on with it! Timestamps [00:03:43] Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger * [00:07:44] WebArena * [00:18:45] Sotopia * [00:24:00] Performance Improving Code Edits * [00:29:39] OpenDevin * [00:47:40] Industry and Academia [01:05:29] Section B: Benchmarks * [01:05:52] SWEBench * [01:17:05] SWEBench/SWEAgent Interview * [01:27:40] Dataset Contamination Detection * [01:39:20] GAIA Benchmark * [01:49:18] Moritz Hart - Science of Benchmarks [02:36:32] Section C: Reasoning and Post-Training * [02:37:41] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection * [02:51:00] Let’s Verify Step By Step * [02:57:04] Noam Brown * [03:07:43] Lilian Weng - Towards Safe AGI * [03:36:56] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis * [03:48:43] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework [04:00:51] Bonus: Notable Related Papers on LLM Capabilities Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger * Guests * Graham Neubig * Aman Sanger - Previous guest and NeurIPS friend of the pod! * WebArena * * Sotopia (spotlight paper, website) * * Learning Performance-Improving Code Edits * OpenDevin * Junyang Opendevin * Morph Labs, Jesse Han * SWE-Bench * SWE-Agent * Aman tweet on swebench * LiteLLM * Livecodebench * the role of code in reasoning * Language Models of Code are Few-Shot Commonsense Learners * Industry vs academia * the matryoshka embeddings incident * other directions * Unlimiformer Section A timestamps * [00:00:00] Introduction to Guests and the Impromptu Nature of the Podcast * [00:00:45] Graham's Experience in Japan and Transition into Teaching NLP * [00:01:25] Discussion on What Constitutes a Good Experience for Students in NLP Courses * [00:02:22] The Relevance and Teaching of Older NLP Techniques Like Ngram Language Models * [00:03:38] Speculative Decoding and the Comeback of Ngram Models * [00:04:16] Introduction to WebArena and Zotopia Projects * [00:05:19] Deep Dive into the WebArena Project and Benchmarking * [00:08:17] Performance Improvements in WebArena Using GPT-4 * [00:09:39] Human Performance on WebArena Tasks and Challenges in Evaluation * [00:11:04] Follow-up Work from WebArena and Focus on Web Browsing as a Benchmark * [00:12:11] Direct Interaction vs. Using APIs in Web-Based Tasks * [00:13:29] Challenges in Base Models for WebArena and the Potential of Visual Models * [00:15:33] Introduction to Zootopia and Exploring Social Interactions with Language Models * [00:16:29] Different Types of Social Situations Modeled in Zootopia * [00:17:34] Evaluation of Language Models in Social Simulations * [00:20:41] Introduction to Performance-Improving Code Edits Project * [00:26:28] Discussion on DevIn and the Future of Coding Agents * [00:32:01] Planning in Coding Agents and the Development of OpenDevon * [00:38:34] The Changing Role of Academia in the Context of Large Language Models * [00:44:44] The Changing Nature of Industry and Academia Collaboration * [00:54:07] Update on NLP Course Syllabus and Teaching about Large Language Models * [01:00:40] Call to Action: Contributions to OpenDevon and Open Source AI Projects * [01:01:56] Hiring at Cursor for Roles in Code Generation and Assistive Coding * [01:02:12] Promotion of the AI Engineer Conference Section B: Benchmarks * Carlos Jimenez & John Yang (Princeton) et al: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR Oral, Paper, website) * “We introduce SWE-bench, an evaluation fram
More Episodes
Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here...
Published 11/15/24
We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out...
Published 11/11/24