Alex Graveley on Building Minion
Alex Graveley, creator of GitHub Copilot, and founder of AI agent company Minion, joins Anton to discuss what it takes to make an AI application actually work for real users.
We discuss design patterns (and anti-patterns) that make for a successful AI application, how to build applications that have LLMs interact with web browsers, and how to build iterative user interaction models. We also talk in-depth about the role of model fine-tuning in AI application development; what to expect, how to gather data from both human and model-based labels, and how to evaluate the resulting improvements.
Released July 24, 2024, Recorded July 3, 2024
Transcript

Alex on X (Twitter): @alexgraveley

Anton on X (Twitter): @atroyn


Anton

That's really the big explosion here is, you no longer have to be a part of a large AI research lab to pick up and work with this technology. It's as general, as ubiquitous as the web itself. I can hit an API or I can run even a local model these days and build something.

Alex

Many people thought it was a waste of time to work on Copilot. We had a tiny, tiny little team. I mean, I had been tracking AI stuff for years at this point, and so I jumped at it. And no one else wanted to do it, no one else believed in AI at that time. And most people just thought AI apps didn't work.

Anton

This is our series, Building With AI. It's intended to help application developers actually figure out what it takes to build something useful with LLMs or other types of AI models. There's not much content out there which is actually practical, useful advice for how to build AI applications. There's a lot of speculation, there's a lot of talk about the models and what they could do, but we're much more interested in application development. I think a great place to start would be for you to introduce yourself, talk a little bit about Minion.

Alex

Sure. Yeah. I'm Alex Graveley. Let's see, I've been doing AI apps since, I guess the first was in 2016, maybe a year before Transformers came out. Back then we were using sequence to sequence and much, much smaller context windows, but the premise was there. You could translate stuff, and not just English to French, but piece of text to other piece of text. And so I played around with that for a long time and made my first system that was for these personal assistants that were chatting with people and going on the web and then doing their tasks for them, so things like managing email calendar, buying stuff, paying bills, that kind of thing. And so at the time, all we could really do is take chat context, put it into this last generation model, get a suggestion out, and then show it to the human who would then edit it or decide what to do. It would teach people that these assistants were varying levels of experience being a personal assistant. And so, one really interesting aspect is that it would give people canned responses, so you didn't have to teach them as much about how to respond or how to be nice when someone's being mean or how to formulate an answer that's in the future like, "Okay, I'll get back to you an hour, blah, blah, blah, blah." What I couldn't figure out was how to interact with the real world at that time.

Anton

This was in 2016?

Alex

2016, yeah. So then I left and worked on some other small startups, like one that was AI based was hCaptcha, which was shipped the front end for hCaptcha, which was the first reCAPTCHA competitor, and now has some insane amount of traffic. And in fact, I think they're either competitive or have more traffic than reCAPTCHA, which is insane.

Anton

They're being used to refine image diffusion models from what I've seen, right?

Alex

They do a bunch of stuff, yeah. I mean, it was really set up as a two-sided marketplace, where you could get your labeling done and then humans would prove themselves, and you'd use-

Anton

Yeah, I mean, that was the reCAPTCHA approach originally too.

Alex

Yeah, except that the labeling part was only for Google employees.

Anton

Yeah, right, sure.

Alex

And so that was the ideas to... And I think now, it turns out with the image models, you can just generate images and just have people label them, and then you don't even need the other side of the marketplace, right?

Anton

Yes.

Alex

So it becomes an even more awesome business.

Anton

So hCaptcha and then-

Alex

Yeah, then I worked on a cryptocurrency, it's still built into signal, called MobileCoin, and still really awesome tech. I don't think it got the attention it deserves, but so be it. All right, so 2020, ended up at GitHub, the opportunity came up to work with OpenAI, who hadn't really exploded into notoriety. I mean, I had been tracking AI stuff for years at this point, and so I jumped at it. I was like, "Please let me go work on..." And no one else wanted to do it. No one else believed in AI at that time. Many people thought it was a waste of time to work on Copilot. We had a tiny, tiny little team, and most people just thought AI apps didn't work. That was a prevailing understanding. They just never work, they either don't work well or the scope is too narrow, or their features are not something that you can make money on.

Anton

Yep. Microsoft's come a long way since then.

Alex

Yeah.

Anton

I got a whole button for Copilot now.

Alex

I feel like that's important. I don't know. I feel like it's kind of cool. That's my favorite part of the outcome of Copilot, is there's a button now.

Anton

There's a button now, yeah. I mean, again, it reminds me, my favorite metaphor for where we're at right now in AI is the early web, and it does remind me of when keyboards had the web button, which had-

Alex

That's a really good point.

Anton

Yeah, it's just like it, isn't it?

Alex

Yeah. No, you're right.

Anton

Anyway, so obviously Copilot is huge, it's probably my number one most used AI application currently, next to the various chats, I have them compete against each other to give me the best answer, but let's move one step beyond that. Let's talk about Minion.

Alex

Sure. Yeah, so after Copilot GA'd, I left to go do this old idea, which is, can I have a computer do assistant tasks for people? I tried with humans acting as assistants and operating web browsers, and so I knew the sort of tasks that people want to do and the complexities involved, and it's been something I've wanted to tackle for years now. It's a good time to do it because the models were finally good enough, basically. The context windows were just starting to be large enough.

Anton

That's a great starting point actually, to get into some of the weeds here. So why does a context window length matter for enabling something like Minion?

Alex

Well, I guess it depends. I mean, right now, it doesn't. Now we've moved that-

Anton

Now it's blown wide open, right?

Alex

I like the metaphor of the arch because you build up the arch and then you put a keystone in, and then the arch is freestanding. But you need the scaffolding around the arch in order to build up the arch so that it's stable enough to put the keystone in it. And so a lot of this kind of technologies are enabling scaffolding. And so what Minion is, it's a chat interface, you go and you tell it, "I need to book a hotel." It'll go search Google, ask you some questions, go to Expedia or Booking.com or whatever you tell it, fill out the form, present some options, check everything through, and then ask you to pay, or if you have other options, you can tell it to do something else. And so what that is is basically, you're taking a snapshot of the webpage, you've taken a snapshot of the chat, you're feeding them into an LLM, and the LLM is telling you what to do next. So that's either clicking on something or filling in a form or replying to the user. There's a few others, but that's the crux of it. And so it turns out webpages are very, especially the large scale websites like Amazon, pages are gigantic. They're half a megabyte, which is a lot of tokens, way beyond what the context size was at those times. And so our first attempts were, okay, well, let's just take all of the interactive fields out of the HTML, and then we will essentially re-rank... We'll just rank them, essentially.

Anton

So we traverse in the HTML to extract those fields?

Alex

Yeah. So we pull buttons and entry fields and this kind of stuff, and then essentially rank them and then take the action that was on the top. And that essentially fit into context.

Anton

So that was basically using the model as more or less a classifier or some sort of ranking model.

Alex

Yeah, more like a ranker. You're basically like, "Here's a bunch of options, pick one." And in fact, the context sizes were so small at that time, what you had to do was make several calls, say, "Here's 10 options or 20 options."

Anton

Binary searcher [inaudible 00:07:49]-

Alex

Yeah, a binary searcher way to an answer, which sucks.

Anton

Well, of course. And the LLM's not guaranteed to have transitive preferences either.

Alex

Yeah, yeah. And when you're taking these things out of context, you just lose everything.

Anton

Yeah. Yep.

Alex

So it's like if you evaluate a bunch of sentences in isolation, they're very different from what you might see when reading an article, say.

Anton

Yes.

Alex

And so that was the first step, and then the second step was, okay, right around that time, longer context models started to come out. And so we were able to fit in, and we did a whole bunch of work to compress the DOMs down to as much as possible, and then we were able to just get them into long context models at the time. And then we could say, "Okay, we'll pick the action to take."

Anton

Were you running your own model or was this say over OpenAI or-

Alex

This was OpenAI. Yeah, this was OpenAI at that time. And so that worked a lot better. But the important thing is they gave us good enough performance where we could start to build intuition about, given a large page, what should we focus on? Which parts of the page should we focus on? And so back to the scaffolding metaphor. Over time, we built enough of a data set where you can then just look at a page in entirety, pick the parts that are important, and then send those to an LLM to decide what to do. So essentially, you're on-the-fly chunking, and then picking which chunks to use. Still with the metaphor of here's the page, here's some chat, pick the next action. And so that long context is really important because we were able to reason over long contexts, so that was a big unlock for sure.

Anton

So let's take maybe a step back here. Could you describe some of the core loop of how Minion works in practice? So my understanding is Minion basically drives a web browser for you in order to interactively achieve the task that you're trying to perform without you as the human having to actually navigate all these websites, right?

Alex

Sure.

Anton

So what's the core loop here? You've already mentioned this point where Minion sees a website, it processes practically the DOM, or you've extracted the most relevant components or compressed it down to that it actually fits inside the LLM. So what actually happens? From the point where a user requests a task or makes a query, what goes on?

Alex

What goes on? So ours is a very simple system. I have very strong opinions about, you have to keep these things simple. You're dealing with language. Language is infinitely complex. Every single word that you add changes the meaning drastically.

Anton

Yes. Also for the LLMs.

Alex

Yeah, also for the LLMs, because navigating this complex space, and so a word changes the part of the manifold that you're in.

Anton

Right. I've always felt that, we talk about controlling these things in terms of prompting, and obviously even when you're asking, say GPT or your own model, which we'll get to in a minute, to digest the contents of this website, you're prompting it to do so, which is, we're attempting to steer these computational machines in natural language. Which to me has, since the start, it has felt like trying to pick a lock with a wet noodle, more or less. We're not really steering them, we're just trying to condition them in the way that we want. Keeping it simple is probably the single biggest lesson that everyone we've spoken to, and in my own experience as well, it's just like, don't try to build too much at once.

Alex

The smaller-

Anton

The smaller, the better.

Alex

Yeah. Just think in terms of spaces. Shrink the space down. Whatever you can do to shrink the space. And every time you're like... There's uncertainty at every step, so you get this compounding uncertainty where it's like if you're traversing through a pipeline of LLM calls, each one of them is right maybe 90% of the time, but by the end, you've got slop.

Anton

Right. So what are the steps? I ask Minion to do something for me, it's got my request, then what?

Alex

So you're on a start page, it's got a search box. You say, "Hey, I want to book a hotel." So it says, "Okay, the current page has a link for booking hotels. I'll click that link." Now you're on Expedia. So now we pull the content from Expedia, we decide which parts are useful. The LM gets that as input, says, "Okay, in the past, users asked me to book a hotel, I've clicked on a link that brings me to Expedia, I'm now on Expedia, here's the content, which should I do next? Oh, I see a date field and a location field and whatever. I don't have the information to fill those out. So what do I do? Okay, I'm going to respond to the user and say, "Okay, sure. When do you want to go? Where do you want to go?"" And so you can imagine that through-

Anton

Multiple steps.

Alex

Multiple steps.

Anton

Until we complete the task, right?

Alex

Exactly.

Anton

So there's already three interesting things in the way that you've described it that I want to drill down more on.

Alex

Okay.

Anton

So the first thing I was going to ask you is, you mentioned early on it's like even... Let's use this booking a hotel or a flight example, right?

Alex

Sure.

Anton

You're saying that somehow Minion remembers the user's preferences. How is that recalled and injected into context?

Alex

Yeah, so right now our preferences are, we've so far just been focused entirely on, "Do the right thing on an arbitrary website. So we're not worried about preferences yet, although we do have some preferences like your name, your email, this kind of stuff. Doing preferences, I think, is a much more open question. So how we do that, I don't know yet.

Anton

Yeah, I've heard different versions of this. We spoke to Flo from Lindy recently, and he gave us a metaphor of having a human almost train the machine, and then the machine is able to recall, "What did the human ask me to do last time?" There's plenty of ways to approach this.

Alex

There's a lot of ways to approach it. Okay, there's two realistic ways you can do it. You can either go through the conversation and say, "Okay, what kind of stuff do I need to extract? Extract a JSON of key-values from this conversation that I should update? And then now maybe you've got other key-values that you've had before, and now you layer the new ones on top. But that's not how humans work. Humans are complicated, and the world is complicated.

Anton

That's right. The world is fractally, fractally complex.

Alex

Yeah. And so maybe I want to use a certain airline like Southwest where I'm flying to a certain part of the country, and maybe when I'm flying overseas, I want to preference another thing. And there's some interesting work actually on creating these taxonomies, but yeah.

Anton

The second thing that I was curious about is, you said that when the agent or the model realizes it doesn't have certain information, it'll ask the user. How does it detect that?

Alex

For whatever reason, the magic of LLMs is kept out of popular discussion, I think for various reasons that I don't fully understand, but I think it's useful to have a simple metaphor for understanding what's going on. And I lean on this metaphor now. I had a rough theory of this metaphor in the Copilot days, and I more strongly believe in the metaphor now, which is LLMs are function approximators. They're approximating a function based on the examples that you show it. If you give it an example that is not what it's seen before, it's going to find roughly the closest two examples and interpolate between them, something like that.

Anton

Something like that. Francois Chollet also has that perspective, right? Vector programs essentially.

Alex

Yeah. And I think that's fine. It's unclear to me that the big question is interpolation versus extrapolation. I don't think it's really that interesting.

Anton

I actually agree with you here completely. I think that actually which one it is isn't that informative for how we build with them, but it might be informative to how we think about them, and that's very helpful. So as you're saying, they're interpolating function approximators for the purpose of you reasoning about them in this particular task.

Alex

Yeah. Literally in this example.

Anton

yes.

Alex

So the way that you get the output that you want is to either essentially either prompt your way into it, so you're forcing the behaviors that you want, or the more classic approach is you're fine-tuning and essentially giving it more examples. So you're saying, "Here's the examples of what I want. Here's an input, here's an output. This is an example. Add this to your knowledge of how to I interpolate between when you see it at inference time." So what you do is early on it would ask, it would either just make up names, it would make up email address, it wouldn't ask anything, it would just, all this kind... And so you have to work through each one of those. Each one of those is a data set. So it's like, "Okay, well," it makes up... It always fills in John Doe, it's very annoying. How do we make it not fill in John Doe? Okay, well, there's some prompting hacks we tried that worked for a while, but-

Anton

What were some of those, if you don't mind?

Alex

Oh, yeah. A good prompting hack was like user information, name, bracket, undefined. So the LLM says, "Oh, well, okay, clearly I don't have the user's name, here it is telling me that it's undefined. I should ask." And so that was a prompting hack that then enables you to generate more training data.

Anton

So you have an in-context example of-

Alex

Yeah, we did at some point. I don't think that we do anymore, but-

Anton

Okay. And then you moved off of that and what do you do now?

Alex

For some information, it's still in context. But I guess my point is, what you're trying to do is use all these tools to get the output that you want. Sometimes it's prompting, sometimes it's prompting to generate a data set to fine-tune a model so that you get the output you want without needing the prompting, sometimes it's both. Sometimes it's, in our case, rarer now that we have a lot much better support for diverse HTML content. But sometimes you have to go in and hack something, like add a hint to what this control does, or fix a bug in basically the structure of your input. So it ends up being this very full stack experience where it's like, okay, well, webpages are complicated, so how do you know when it's done? It's actually quite difficult to know-

Anton

That was actually my next question. But before we get onto that part, still the question is, you've got these prompting techniques for helping the model understand, "Hey, I don't actually have this information," hallucinate it. By providing it examples where it wouldn't have the hallucination, where it wouldn't have that information in context, and then basically the model can pattern match on that, as you said.

Alex

Yeah. You know what, you're right. You're right. It's good to dig into that example. So let's say through some prompting tricks or just rewriting output, manually correcting outputs, in the case where the name is provided in the chat, then you fill it in, and in the case where it's not provided, you ask a question, and now you have some number of examples of each of these, ideally in diverse scenarios. And then you fine-tune on that, and now you're interpolating between these two, essentially two directions. "Do I have this information?" In which case they fill it in, "Do I not have the information? I don't fill it in." And also in the mix there is constraint checking. So, "Do I fill a field with a value that is not derivable from the context?"

Anton

Yeah. See, this is one of the things that I think about a lot, because in my opinion, and I promise not to get into AGI discourse, but I have to briefly for this part, I think it's actually really important for us to figure out how we can get the models to understand what they know and what they don't know without having to positively train in these negative examples like you're describing. So what I mean by that is, you're forced to generate a data set to tell the model it doesn't know something, which is not really the way that humans work. We don't really need examples of things we don't know to understand that we don't know them.

Alex

I'm not sure. I'm not sure if that's true.

Anton

Yeah, I mean it feels that way to me anyway. I just think that this is a really fundamental capability, and the different approaches to thinking about it I think are actually really important to get the next gen applications as well.

Alex

Yeah, I guess one thing to consider there is, when you're in the land of fine-tuning, you get some amount of generalization. And so essentially what you're doing is you're saying, "How many examples do I need in order to generalize about this concept?" Name is one example that keeps popping up where we need to... But maybe the general case of, "I need to fill in some information, I don't have the information."

Anton

You don't generalize enough about that.

Alex

Yeah. Maybe it is more than just the name, you need that same pattern of what I know and what I don't know, you need it for addresses and phone numbers and et cetera, but at some point you start generalizing, when the model's able to say, "Oh, okay, instead of just making something up, I'll fill in what I have, or I'll ask a question." And so at some point it pops and then you have some confidence, and this is all tested with eval so when you're succeeding, when you're failing, what percentage of the time you're succeeding and failing. But at some point, you're generalizing. And unfortunately, it's like, again, life is very complicated, and so-

Anton

The world is fractally complex.

Alex

Yeah. So it's like you can't say that, "We're generalizing about form-filling accurately, and so now we're done." You're constantly looking and saying like, "Okay, well, it broke here. Let's add an eval. Is this a recurring thing? How many times does it occur? Okay, do we need a custom data set? Can we do..." You go through the whole routine, "Can we prompt it? Do we need a data set? Is there an existing data set we can upsample, downsample," all this kind of stuff.

Anton

So we have two approaches here, we've got the in-context learning via prompting, and we've got fine-tuning to help the model understand what it should call out and ask for questions. The very next thing that you said here was, how do we know we're done? So how do we know we're done? How does the model know it's finished and should stop executing tasks?

Alex

Oh, that's a good question. So doneness for us is fairly open-ended. We're making an interactive online in app, so essentially our control loop is something like, we do stuff until we ask the user a question and then we stop. Or we do stuff until we send a user message and then we stop. And the user then chooses to do something else or not. And so that keeps the control loop very simple, and it also means we don't ever have to identify really success/failure. That's more of an offline thing.

Anton

It's on the user.

Alex

Yeah. I mean, you still want to say like, "Hey, you've booked your hotel. Congrats. Anything else you need?" And if the user trails off, then that's up to them. If there is something else that you need, then you're in that same control loop, so there's not really any done state.

Anton

So again, just a diversion here, but do you find, let's suppose we've gone through the flow of booking a hotel and now the user goes, "Oh, I also want to book a flight," you're still maintaining that entire hotel conversation and context, right? Because you don't know which parts of it to dump out or not.

Alex

Yep. Yep.

Anton

So how does that work? Because now the model, because you've got so much hotel information in it, it's thinking about hotels, and now the user wants it to book a flight. What happens?

Alex

Yeah, it's difficult. So this is where you get into... Important to realize is when you're fine-tuning, you're actually learning how to pay attention. Not just-

Anton

That's interesting. That's a really good way of thinking about it.

Alex

Not just what to do. Well, I guess attention is part of what to do. So what you're paying attention to in the context is something that you're learning. And we see this now with, the crazy thing, and why I said earlier, why the metaphor of a function approximator really is the best description that I have so far is because, I don't know, for the last years we've heard like, "Oh, LLMs can't do this. They can't do long context, they can't do multiple needles in the haystack, they can't do this, that the other thing." And then someone comes up with a data set, turns it into a model, and then the model does it. And so I'm very, very skeptical whenever anyone says the models can't do that. And so for a lot of these things, it's really just thinking creatively about the problem. So it's like okay, well, it's confusing. Here's two tasks. One thing you can try is randomly choosing two tasks and sticking them together so the execution is consistent for each one.

Anton

So this is for generating a data set, right?

Alex

Generating, yeah. And then there's cases where, in the hotel plus flight booking for instance, is interesting because then often the data bleeds into two. And so what you really want is labelers to go through and accomplish these complex tasks. And then as you create more samples, you're learning how to pay attention to what's important.

Anton

Right. So let's zoom out a little bit, because all along the way there, we've talked about fine-tuning, we've talked about labeling data sets. Obviously, that means at some point, you've started using your own model. Are you running your own inference, or are you offloading to a fine-tuning API? What's, when you say fine-tuning, what's actually going on?

Alex

We would if we could. It's really crazy. From my point of view, a lot of choices that the larger labs do make total sense to me now. Whereas before they were more opaque.

Anton

Such as?

Alex

Well, it makes sense to have a very crisp idea of what is pre-training and what is post-training. Because your post-training is a lot more malleable, it's a lot faster iteration, your pre-training is, you want to have it as general as possible so that you can fine-tune for different scenarios as well. We did a bunch of reasoning work to improve the reasoning, so part of this is that you're thinking about what to do next. And so a bunch of techniques for how to reason about the state of the page, how to reason about the state of the chat, how to reason-

Anton

What do you mean by that? When you say how to reason, are you showing the model examples of how a human would reason about the content of a page? What does that mean?

Alex

More or less, yeah. It's like-

Anton

What does that data look like in practice?

Alex

So a common technique is that you're using an LLM to ask a bunch of questions, get the answers, and then as an offline step where you're basically doing a whole bunch of reasoning, is this the right page? Do I have enough data? Is everything in an expected state? What are my next steps? What's the next logical thing to do? And each of those are variables, and there's many more. And so what you can do is ask all those questions offline and then decide what's relevant, and then train the model on that output. So now you're mirroring an elaborate offline reasoning process online, where you're able to say, "Okay, well, the page is not right," or, "I have the information I need, I know what to do next." Reasoning is a overloaded term. I would say more thinking about what to do with regards to the information that's available to you.

Anton

That sounds a lot like reasoning.

Alex

Yeah, I guess it's a messy distinction.

Anton

Let's continue down the fine-tuning road for a minute. So you mentioned LoRa, which is something that you unlock if you have an open source model, you have access to the weights. LoRa requires you to have access to the weights, unless someone's providing you with a LoRa API. You're basically only tuning a subset of the parameters, and the subset is chosen in this particular special way. It's like an approximation to the full space of the model. And we fine-tune that approximation, and then we somehow modify the full weights of the model, according to me. I think that's a pretty reasonable [inaudible 00:27:08].

Alex

Yeah. Okay. Maybe the easiest way to say it is, instead of changing all the weights of the model, you're making a diff. And then that diff is much more succinct, and so now you can move it around as a file, as opposed to multiple gigabytes that you need to worry about.

Anton

Yeah, that's good.

Alex

And so there's been lots of cool work with swapping them out dynamically or having a bunch of memory with lots of reused memory. All very similar to how OSs work, by the way.

Anton

Yes. Yeah, things are converging in that direction. I think about that a lot right now, actually. And of course I think about especially the role of memory and storage subsystems when we start talking about these things as general purpose computers, which is my business.

Alex

Sure. Just to be clear, LoRas are not as good. But the other aspect of it is, we're a small company. It's much more time and cost-effective to be doing LoRas. And so in practice, we ended up using it for everything because it saves time and money.

Anton

Right. Gotcha.

Alex

It was especially useful when GPs were harder to come by.

Anton

So let's talk about some of the data. So fine-tuning, LoRa, whichever way you go about this, you need data sets, and you need labeled data sets, whether you're using supervised training where you have positive labels of like, "Hey, this is more of the stuff that I want," or contrastive where you're able to do positive and negative labels. And we talked about two data sets, you mentioned human labeling and you mentioned labels coming from GPT-4 for GPT-3.5. Let's start with human labeling. How do you create this data set for something like Minion, which is taking consecutive actions, it's a very causal sort of thing that you want the model to do?

Alex

It might be useful for builders. Some of the intuition that I learned on Copilot, and I guess I was one of the earlier people working with these things by chance, and the intuition is like, it gets better. You can always make it better. And so the progression from Copilot was, we got this artifact from OpenAI, it was-

Anton

That was Codex, right?

Alex

No, it was a midsize model. It was trained on code that they had that they thought like, "Hey, maybe we can do something with this." And we tried a bunch of stuff with it, and sometimes it was right. About 6, 8% of the time you would ask it to generate a function with a test, and it would generate the function correctly. We would have to generate like 10 times and then run the test, then see which one was right and blah, blah, blah. And it's like oh, this is pointless. Try to make a UI around it where you'd pick the right one, or it would show you all the options, all this kind of stuff. And then the next version of the model was 15% of the time it would do the right answer. And it's just these little jumps all along the way, 15 to 30, 35, 40, 50, 60. And by the time we shipped Copilot about 12 months later, we had probably the craziest eval you can imagine, which is, I can get into it if you want, but it started with some sub 10%, and then up around 60%, 70% would be-

Anton

Well, I mean, what was that eval? It sounds crazy, I'm very curious about it.

Alex

So we would download a random GitHub repository that was in Python that had pytests. We'd take a random pytest that we could execute successfully. We would blank out the body, have the LLM write the body, rerun the test, and now you've got a true/false. And so in the beginning, it that always failed, and then Incrementally over time, better prompting, better models. At that time, we weren't fine-tuning, they were producing better models, we were improving prompting a lot, and towards the end, 60, 70% of the time, we're generating a function body. And that's insane.

Anton

To go from nothing to 70% is pretty remarkable.

Alex

Yeah. That's the function approximator ideas. You start out in the very coarse understanding of what the real function is, and over time you're refining, you're showing more relevant examples, so that it knows how to interpolate at every point, to the point where you can get very, very accurate. And so there's this whole, I'm sure in a year or so people will be talking about it, but there's an emotional curve to developing these kinds of things, where it's like you're stuck at 15% and it's like this is never going to work, there's no way out, and you just have to... That's when you try random stuff, right?

Anton

Yep. And again, everyone we've spoken to says the same thing.

Alex

Really?

Anton

Yeah, absolutely. When we spoke to Flo, and with Rahul, it's the same thing, it's like we couldn't get it to do something, so we just started trying things, and eventually one of them stuck and we doubled down on it, and it worked out.

Alex

Yeah, it's crazy.

Anton

And that really seems like a pattern in developing with these things.

Alex

Yeah, it's crazy because it's also a fundamentally different way of writing software. And it's interesting because there's some people that are able to make that transition and some people that aren't, from a deterministic system that you can reason about to something that is much more trial and error.

Anton

Yeah, and I also think that, look, some of my favorite work in this whole space right now, just working with LLMs in general, even though it's not a practical application, it's just like, some of the folks who are really pushing the edges of what the models do and are having Claude talk to itself and seeing when it finally wigs out and does something crazy, or simulate a fake web, stuff like that. Because I agree with you, this is a very different way of developing software, but I also think that it has this kind of pernicious thing. So all software has this problem, and it's the reason why you write bugs. All software has this problem where you have a mental model of what you expect the computer to do, and then when your mental model doesn't match what the computer's actually going to do, that's when you write a bug. In LLM land, at least computers are deterministic, at least computers if you actually sit down and reason about it because you have a spec of how the computer works, you're going to get to the right answer. In LLMs, no guarantees. All bets are off.

Alex

Yeah, sure.

Anton

Although it's good to develop an intuition like yours where you're saying, "Okay, well, clearly I need such and such an example so that it can interpolate between them and actually work it out." Let's go back to fine-tuning for a minute, because I'm very curious about that. So we've got human labelers, and don't feel the need to reveal any secret sauce, but you've got human laborers, presumably they're performing these tasks, you have some version of recording what they're doing on these websites, and then you're asking the model to basically make predictions in the same that way we make predictions-

Alex

That's gone through a bunch of iterations. Okay, so I guess the other crazy thing to think about, and it's like magic, basically, which is that the models help you make the models better.

Anton

Talk about that.

Alex

So for instance, for the first several iterations of labeling, what we would do is we would use the existing model to, or the best model that we had, maybe not the production inference model, but whatever the best model we had was, we'd ask it for a generation. The label would then say yes or no, and keep generating until the right thing happens. And then we get into a world where, okay, well, the labeler knows what it's supposed to do, but maybe the labeler can hint it a little bit in the right direction. So we did that for a while where the labelers were, we'd give them the suggestion, they'd say that's wrong, then they'd tell us what to do instead, then we'd regenerate and then use it. And eventually the model knows how to reason about these cases enough where you can literally just watch the labeler act, and then the reasoning makes sense.

Anton

Gotcha.

Alex

You're able to backfill the reasoning because you've got a model that's seen a bunch of stuff.

Anton

So for the label datasets, for the fine-tuning and the LoRas, what's the scale of these things, roughly? How much label data do you need?

Alex

Oh, not much. Way less than you think.

Anton

Order millions, order hundreds?

Alex

No, no, no, no. I'd say probably, for us right now, it's always in flux because you're changing data, you're shrinking datasets, you're rewriting datasets, all this kind of stuff. So I think probably we have, I'm going to guess maybe 140,000 samples, of which we use 30.

Anton

Thousand or 30?

Alex

30,000.

Anton

Gotcha.

Alex

30,000. I mean, that's like a refining process.

Anton

So you generate 140,000 samples, you refine them to 30,000.

Alex

Yeah.

Anton

How do you do that?

Alex

And that's always in flux.

Anton

Yeah. What's the refining step?

Alex

All sorts of stuff. Sometimes it's humans going through and being like, "This is wrong. Do this better." And then we'll toss it back to label players with that human advice on what to do better. Then you train a model to do that, and now you've got a human do it, and now you've got an LLM do it, and you calibrate the two, and you can do that at the... We care a lot about trajectories, so most LLM work these days is just single steps, so for us, we need both, so there's correctness at the step level, correctness at the trajectory level. And so we have graders for all of that, there's hallucination checkers, there's reasoning checkers, there's what else? Yeah, other stuff. There's lots of checkers.

Anton

Gotcha. So after generating the data set, you go through this refinement process, multi-step, got a bunch of things, you've got roughly 30K samples, and then, is that for a fine-tune or a LoRa, or-

Alex

That's both. We tend to use the same data sets for both.

Anton

Which is more expensive? I won't ask you directly what the cost, but which is more expensive, the compute or the labeling? GPU hours cost money, labeling hours cost money. Which one ends up being more expensive to generate the status and then fine-tune on it?

Alex

I mean, at the scale that we're at now, the quality is still so high from humans that it's worth spending the money. As the model gets better, for instance, we're able to use self... We haven't talked about self-play at all, but that's a big part of it. So that's when you have the model steering itself, and you have a user model that's acting as a user would, you're synthesizing tasks because another part of this is synthesizing tasks. It's like, okay, let's say, not so much now because we're starting to see generalization effects, but let's say, okay, we want to handle Amazon. Let's use an LLM to generate some tasks. So easy thing to do is generate a very full description of a task early on, and then have a labeler go and execute that task. But that actually introduces a bias because most people don't know what they want all ahead of time. You have to extract it from them. And so then another refinement step is to introduce a user model where you give it a blob of information, which is a task description, and then you say, "Hey, act like a human would and answer questions as they're asked." We tried all sorts of stuff. We tried different bios to introduce randomness. I'd still like to introduce a schizo bio where it's just like, just completely changes its mind all the time.

Anton

I mean, there's a recent data set of the web-generated personas, there's like a million personas.

Alex

I saw that, yeah.

Anton

There's probably something like that in there, if I had to guess.

Alex

Yeah, it's interesting. In here, you might run into a limitation of what a black box model might be able to do. Often, they're not creative enough to approximate what a human would do.

Anton

Yep. Well, if they could, they would, right? But-

Alex

Yeah, exactly. They just may not be-

Anton

The limitations are there. They have to be.

Alex

Yeah. There's an implicit bias in training sets, which is, people publish stuff that tends to make sense.

Anton

Yes.

Alex

Right? The stuff that's on the internet kind of makes sense.

Anton

Yes.

Alex

Humans don't always make sense.

Anton

Not even to themselves. People don't necessarily even hold stable preferences from day to day.

Alex

This is a fundamental belief of mine, which is, in Copilot days, we tried a lot of things with, tell them what we want and then generate it, and we found that, I think generally people don't really think ahead of what they're doing. I think people are much more like a React model where it's like you're-

Anton

I actually think that's one of the more powerful user interface paradigms that come out of AI. And I don't think you could do that with traditional software very well before. With traditional software, you have to guide people through this very specific flow, first, do X, then do Y, then do Z. And the human using that software has to know what they want, not just overall, but at the time where you're presenting them with your option. With AI, you can drill down in whatever direction you want to go.

Alex

Yeah, it's true there's a lot more variability. I mean, there's a lot more [inaudible 00:39:45].

Anton

But even think about using something maybe Midjourney, wherever you're generating an image and the first image it spits out may not be anything like what you actually imagined, but it's different in ways that you can now get a handle on and you can move around in your own conceptual space, because it's this iterative interface. I think that's really, really interesting and under explored.

Alex

Yeah, for sure. For sure. I mean, it's the same thing with co-generation, right?

Anton

yes.

Alex

It doesn't always do what you want, but sometimes it's right.

Anton

Yeah. And sometimes it does help you reason about, even when it's wrong, it helps you reason about the thing [inaudible 00:40:12].

Alex

Yeah. I mean, there's a very deep question here about, I'm able to evaluate code, or an image, as to whether it's correct or not, much faster than I can search the space of all possible things, and so-

Anton

It's the generator-discriminator gap, it's classic, right? In ML, and it turns out probably cognitively, it's classic.

Alex

Yeah, it's interesting.

Anton

Stepping back and just thinking about this space in general, one of the things that I think is really important to think about as an AI application developer is, what's your plan for as the models get smarter? So I'll put that question to you. The models are going to get generically smarter over time. What does that mean for Minion?

Alex

Oh, yeah, that's a good question. What does that mean for Minion? I think if we do it right, we can ride the generalization, which I'm most excited about. For instance, one of the early findings is that we have some Amazon data, we don't have any Etsy, eBay, Shopify, Flipkart, any data on that. But the Amazon data improves the performance on all those other sites. And so that's really exciting. That means that you're generalizing. And so my hunch is that the web generalizes, which is extremely exciting, because that means that you can imagine an agent that is reliable for scenarios that are in the long tail.

Anton

Yep. But one can also imagine a future here where the web just goes away.

Alex

Yep. I think they're related. In some sense, the function-calling approach that Apple is trying to do is something that's more... It's a constrained space. I know these apps that I have installed, I know these APIs that they have, pick the right one. The web is much more broad. Here I find myself on a random lawyer's website and I need to fill out a form, or I need to find the form to fill out, or I need to decide I've looked in all the obvious places for the hours that this restaurant is open and I can't find it, so what do I do? One of the early positions we took with Minion is that, because most of the things that people call agents are API calling, so there's a bunch of APIs, you have the LLM call them, you insert the output into the prompt, and then you regenerate. And so my hunch early is that these APIs are so disparate. You've got one API that does a chunk of behavior, and then you've got another API that does another chunk of behavior. And so what's to know that there's not a gap between those two APIs where I can't do certain things? Whereas the web solves that because the web is what's being... Maybe another way to put it is that the APIs that are exposed in the world are less functional than what you can do on the open web.

Anton

Yep. That's true.

Alex

If you can do it, it's on the web, more or less, whereas it's unclear with APIs. And so even if there's a little bit of a discord there, that means you've got to fill in all the gaps by manually writing APIs to figure out, and figuring out where those gaps are. So I guess for me, the web in general is more easy to interpolate between what's possible.

Anton

Right. Because there are more possibilities.

Alex

Yeah, yeah. It's like I'm not trying to bucket into these two APIs that I have, that both make sense. Instead, there's always alternate paths I can go down that maybe accomplish the same thing.

Anton

All right, two final questions. First one is, you've been working on this for pretty much as long as anyone, and when I say this, I mean AI applications.

Alex

Sure.

Anton

Probably as long as anyone. 2016 is a very early start to be trying to build something like this.

Alex

There's people who have been doing it much longer.

Anton

Yeah, of course, but it's like the ecosystem's tiny and baby, but it's growing now. So as a relative proportion, there are fewer people who've been at it for this long.

Alex

Oh yeah. And I have a total imposter syndrome because when I got into it in 2016, everyone had a PhD.

Anton

You don't need one anymore, thank God.

Alex

Yeah, I don't think you need one anymore. Maybe you never did. In fact, the reason I stopped working on it right before Transformers came and only picked it up in 2020 was because I'm like, "Well, I can't make progress on this without a PhD. This is stupid. I need a math PhD to make progress here." And that was wrong. I should have just stuck with it.

Anton

The world has changed very much since then.

Alex

Yeah.

Anton

Look, that's really the big explosion here is, you no longer have to be a part of a large AI research lab to pick up and work with this technology. It's as general, as ubiquitous as the web itself. I can hit an API or I can run even a local model these days and build something. So my question is something like, in hindsight, from where you are now, what decision would you have made differently or earlier, given your current knowledge? Specifically with respect to how you're building Minion.

Alex

The number of people that will look at 5% performance and say, "It doesn't matter, we'll get to 70 or 80 or 90," is very few. So stick through. Stick with it. Unlike other software, you make it and then it either works or it doesn't, there's this element of goodness. And in some sense, that's why you've seen AI be mostly true believers. Because you have to pound at it, think creatively, lots of setbacks. I try a thing that I'm sure it's going to work, it doesn't work.

Anton

But I think part of that is also the fact that you do see it get better. And that's how you get that kind of true belief in the first place, is like, "Wow, this actually does improve if I keep hacking at it."

Alex

No, it's crazy that when hear old, crazy theories from OpenAI, people and they're like, "Oh wait, I get it now." It's like the idea that-

Anton

Well, especially when you're doing a pre-training run for very large models and you can literally see its capabilities improving, it gets much easier to believe that their capabilities will continue to improve.

Alex

Yeah, yeah. Or even the crazier stuff, like it's the future sending some pathway through the past to align these things. Because that's what it feels like, is like you're watching these things generalize, where generalization means I haven't shown it what to do, but it does the right thing, and that starts small, but I think it snowballs.

Anton

So last one, let's assume that I've decided I want to build something, what's the giant sticky-outy piece of advice that you should give that person? What should I know from industry veteran?

Alex

Ah, okay. No one knows what the answer is.

Anton

Yeah, I think that that's right.

Alex

No one knows, so-

Anton

It's too early.

Alex

It's too early, but it's also, again, you're dealing with language, which is infinitely complex, and now more and more we're dealing with the real world, which is even more infinitely complex. And so it's unclear to me that the answer is knowable. There's no one way to be, there's no one way to live, there's no one way to believe. And so I don't think there's a single way to build AI software. It's really very contextual.

Anton

Right. Makes sense. Cool. Thanks a lot.

Alex

Yeah, sure. Cool. No, good discussion.

2024 Chroma. All rights reserved