Alex on X (Twitter): @alexgraveley
Anton
That's really the big explosion here is, you no longer have to be a part of a large AI research lab to pick up and work with this technology. It's as general, as ubiquitous as the web itself. I can hit an API or I can run even a local model these days and build something.
Alex
Many people thought it was a waste of time to work on Copilot. We had a tiny, tiny little team. I mean, I had been tracking AI stuff for years at this point, and so I jumped at it. And no one else wanted to do it, no one else believed in AI at that time. And most people just thought AI apps didn't work.
Anton
This is our series, Building With AI. It's intended to help application developers actually figure out what it takes to build something useful with LLMs or other types of AI models. There's not much content out there which is actually practical, useful advice for how to build AI applications. There's a lot of speculation, there's a lot of talk about the models and what they could do, but we're much more interested in application development. I think a great place to start would be for you to introduce yourself, talk a little bit about Minion.
Alex
Sure. Yeah. I'm Alex Graveley. Let's see, I've been doing AI apps since, I guess the first was in 2016, maybe a year before Transformers came out. Back then we were using sequence to sequence and much, much smaller context windows, but the premise was there. You could translate stuff, and not just English to French, but piece of text to other piece of text. And so I played around with that for a long time and made my first system that was for these personal assistants that were chatting with people and going on the web and then doing their tasks for them, so things like managing email calendar, buying stuff, paying bills, that kind of thing. And so at the time, all we could really do is take chat context, put it into this last generation model, get a suggestion out, and then show it to the human who would then edit it or decide what to do. It would teach people that these assistants were varying levels of experience being a personal assistant. And so, one really interesting aspect is that it would give people canned responses, so you didn't have to teach them as much about how to respond or how to be nice when someone's being mean or how to formulate an answer that's in the future like, "Okay, I'll get back to you an hour, blah, blah, blah, blah." What I couldn't figure out was how to interact with the real world at that time.
Anton
This was in 2016?
Alex
2016, yeah. So then I left and worked on some other small startups, like one that was AI based was hCaptcha, which was shipped the front end for hCaptcha, which was the first reCAPTCHA competitor, and now has some insane amount of traffic. And in fact, I think they're either competitive or have more traffic than reCAPTCHA, which is insane.
Anton
They're being used to refine image diffusion models from what I've seen, right?
Alex
They do a bunch of stuff, yeah. I mean, it was really set up as a two-sided marketplace, where you could get your labeling done and then humans would prove themselves, and you'd use-
Anton
Yeah, I mean, that was the reCAPTCHA approach originally too.
Alex
Yeah, except that the labeling part was only for Google employees.
Anton
Yeah, right, sure.
Alex
And so that was the ideas to... And I think now, it turns out with the image models, you can just generate images and just have people label them, and then you don't even need the other side of the marketplace, right?
Anton
Yes.
Alex
So it becomes an even more awesome business.
Anton
So hCaptcha and then-
Alex
Yeah, then I worked on a cryptocurrency, it's still built into signal, called MobileCoin, and still really awesome tech. I don't think it got the attention it deserves, but so be it. All right, so 2020, ended up at GitHub, the opportunity came up to work with OpenAI, who hadn't really exploded into notoriety. I mean, I had been tracking AI stuff for years at this point, and so I jumped at it. I was like, "Please let me go work on..." And no one else wanted to do it. No one else believed in AI at that time. Many people thought it was a waste of time to work on Copilot. We had a tiny, tiny little team, and most people just thought AI apps didn't work. That was a prevailing understanding. They just never work, they either don't work well or the scope is too narrow, or their features are not something that you can make money on.
Anton
Yep. Microsoft's come a long way since then.
Alex
Yeah.
Anton
I got a whole button for Copilot now.
Alex
I feel like that's important. I don't know. I feel like it's kind of cool. That's my favorite part of the outcome of Copilot, is there's a button now.
Anton
There's a button now, yeah. I mean, again, it reminds me, my favorite metaphor for where we're at right now in AI is the early web, and it does remind me of when keyboards had the web button, which had-
Alex
That's a really good point.
Anton
Yeah, it's just like it, isn't it?
Alex
Yeah. No, you're right.
Anton
Anyway, so obviously Copilot is huge, it's probably my number one most used AI application currently, next to the various chats, I have them compete against each other to give me the best answer, but let's move one step beyond that. Let's talk about Minion.
Alex
Sure. Yeah, so after Copilot GA'd, I left to go do this old idea, which is, can I have a computer do assistant tasks for people? I tried with humans acting as assistants and operating web browsers, and so I knew the sort of tasks that people want to do and the complexities involved, and it's been something I've wanted to tackle for years now. It's a good time to do it because the models were finally good enough, basically. The context windows were just starting to be large enough.
Anton
That's a great starting point actually, to get into some of the weeds here. So why does a context window length matter for enabling something like Minion?
Alex
Well, I guess it depends. I mean, right now, it doesn't. Now we've moved that-
Anton
Now it's blown wide open, right?
Alex
I like the metaphor of the arch because you build up the arch and then you put a keystone in, and then the arch is freestanding. But you need the scaffolding around the arch in order to build up the arch so that it's stable enough to put the keystone in it. And so a lot of this kind of technologies are enabling scaffolding. And so what Minion is, it's a chat interface, you go and you tell it, "I need to book a hotel." It'll go search Google, ask you some questions, go to Expedia or Booking.com or whatever you tell it, fill out the form, present some options, check everything through, and then ask you to pay, or if you have other options, you can tell it to do something else. And so what that is is basically, you're taking a snapshot of the webpage, you've taken a snapshot of the chat, you're feeding them into an LLM, and the LLM is telling you what to do next. So that's either clicking on something or filling in a form or replying to the user. There's a few others, but that's the crux of it. And so it turns out webpages are very, especially the large scale websites like Amazon, pages are gigantic. They're half a megabyte, which is a lot of tokens, way beyond what the context size was at those times. And so our first attempts were, okay, well, let's just take all of the interactive fields out of the HTML, and then we will essentially re-rank... We'll just rank them, essentially.
Anton
So we traverse in the HTML to extract those fields?
Alex
Yeah. So we pull buttons and entry fields and this kind of stuff, and then essentially rank them and then take the action that was on the top. And that essentially fit into context.
Anton
So that was basically using the model as more or less a classifier or some sort of ranking model.
Alex
Yeah, more like a ranker. You're basically like, "Here's a bunch of options, pick one." And in fact, the context sizes were so small at that time, what you had to do was make several calls, say, "Here's 10 options or 20 options."
Anton
Binary searcher [inaudible 00:07:49]-
Alex
Yeah, a binary searcher way to an answer, which sucks.
Anton
Well, of course. And the LLM's not guaranteed to have transitive preferences either.
Alex
Yeah, yeah. And when you're taking these things out of context, you just lose everything.
Anton
Yeah. Yep.
Alex
So it's like if you evaluate a bunch of sentences in isolation, they're very different from what you might see when reading an article, say.
Anton
Yes.
Alex
And so that was the first step, and then the second step was, okay, right around that time, longer context models started to come out. And so we were able to fit in, and we did a whole bunch of work to compress the DOMs down to as much as possible, and then we were able to just get them into long context models at the time. And then we could say, "Okay, we'll pick the action to take."
Anton
Were you running your own model or was this say over OpenAI or-
Alex
This was OpenAI. Yeah, this was OpenAI at that time. And so that worked a lot better. But the important thing is they gave us good enough performance where we could start to build intuition about, given a large page, what should we focus on? Which parts of the page should we focus on? And so back to the scaffolding metaphor. Over time, we built enough of a data set where you can then just look at a page in entirety, pick the parts that are important, and then send those to an LLM to decide what to do. So essentially, you're on-the-fly chunking, and then picking which chunks to use. Still with the metaphor of here's the page, here's some chat, pick the next action. And so that long context is really important because we were able to reason over long contexts, so that was a big unlock for sure.
Anton
So let's take maybe a step back here. Could you describe some of the core loop of how Minion works in practice? So my understanding is Minion basically drives a web browser for you in order to interactively achieve the task that you're trying to perform without you as the human having to actually navigate all these websites, right?
Alex
Sure.
Anton
So what's the core loop here? You've already mentioned this point where Minion sees a website, it processes practically the DOM, or you've extracted the most relevant components or compressed it down to that it actually fits inside the LLM. So what actually happens? From the point where a user requests a task or makes a query, what goes on?
Alex
What goes on? So ours is a very simple system. I have very strong opinions about, you have to keep these things simple. You're dealing with language. Language is infinitely complex. Every single word that you add changes the meaning drastically.
Anton
Yes. Also for the LLMs.
Alex
Yeah, also for the LLMs, because navigating this complex space, and so a word changes the part of the manifold that you're in.
Anton
Right. I've always felt that, we talk about controlling these things in terms of prompting, and obviously even when you're asking, say GPT or your own model, which we'll get to in a minute, to digest the contents of this website, you're prompting it to do so, which is, we're attempting to steer these computational machines in natural language. Which to me has, since the start, it has felt like trying to pick a lock with a wet noodle, more or less. We're not really steering them, we're just trying to condition them in the way that we want. Keeping it simple is probably the single biggest lesson that everyone we've spoken to, and in my own experience as well, it's just like, don't try to build too much at once.
Alex
The smaller-
Anton
The smaller, the better.
Alex
Yeah. Just think in terms of spaces. Shrink the space down. Whatever you can do to shrink the space. And every time you're like... There's uncertainty at every step, so you get this compounding uncertainty where it's like if you're traversing through a pipeline of LLM calls, each one of them is right maybe 90% of the time, but by the end, you've got slop.
Anton
Right. So what are the steps? I ask Minion to do something for me, it's got my request, then what?
Alex
So you're on a start page, it's got a search box. You say, "Hey, I want to book a hotel." So it says, "Okay, the current page has a link for booking hotels. I'll click that link." Now you're on Expedia. So now we pull the content from Expedia, we decide which parts are useful. The LM gets that as input, says, "Okay, in the past, users asked me to book a hotel, I've clicked on a link that brings me to Expedia, I'm now on Expedia, here's the content, which should I do next? Oh, I see a date field and a location field and whatever. I don't have the information to fill those out. So what do I do? Okay, I'm going to respond to the user and say, "Okay, sure. When do you want to go? Where do you want to go?"" And so you can imagine that through-
Anton
Multiple steps.
Alex
Multiple steps.
Anton
Until we complete the task, right?
Alex
Exactly.
Anton
So there's already three interesting things in the way that you've described it that I want to drill down more on.
Alex
Okay.
Anton
So the first thing I was going to ask you is, you mentioned early on it's like even... Let's use this booking a hotel or a flight example, right?
Alex
Sure.
Anton
You're saying that somehow Minion remembers the user's preferences. How is that recalled and injected into context?
Alex
Yeah, so right now our preferences are, we've so far just been focused entirely on, "Do the right thing on an arbitrary website. So we're not worried about preferences yet, although we do have some preferences like your name, your email, this kind of stuff. Doing preferences, I think, is a much more open question. So how we do that, I don't know yet.
Anton
Yeah, I've heard different versions of this. We spoke to Flo from Lindy recently, and he gave us a metaphor of having a human almost train the machine, and then the machine is able to recall, "What did the human ask me to do last time?" There's plenty of ways to approach this.
Alex
There's a lot of ways to approach it. Okay, there's two realistic ways you can do it. You can either go through the conversation and say, "Okay, what kind of stuff do I need to extract? Extract a JSON of key-values from this conversation that I should update? And then now maybe you've got other key-values that you've had before, and now you layer the new ones on top. But that's not how humans work. Humans are complicated, and the world is complicated.
Anton
That's right. The world is fractally, fractally complex.
Alex
Yeah. And so maybe I want to use a certain airline like Southwest where I'm flying to a certain part of the country, and maybe when I'm flying overseas, I want to preference another thing. And there's some interesting work actually on creating these taxonomies, but yeah.
Anton
The second thing that I was curious about is, you said that when the agent or the model realizes it doesn't have certain information, it'll ask the user. How does it detect that?
Alex
For whatever reason, the magic of LLMs is kept out of popular discussion, I think for various reasons that I don't fully understand, but I think it's useful to have a simple metaphor for understanding what's going on. And I lean on this metaphor now. I had a rough theory of this metaphor in the Copilot days, and I more strongly believe in the metaphor now, which is LLMs are function approximators. They're approximating a function based on the examples that you show it. If you give it an example that is not what it's seen before, it's going to find roughly the closest two examples and interpolate between them, something like that.
Anton
Something like that. Francois Chollet also has that perspective, right? Vector programs essentially.
Alex
Yeah. And I think that's fine. It's unclear to me that the big question is interpolation versus extrapolation. I don't think it's really that interesting.
Anton
I actually agree with you here completely. I think that actually which one it is isn't that informative for how we build with them, but it might be informative to how we think about them, and that's very helpful. So as you're saying, they're interpolating function approximators for the purpose of you reasoning about them in this particular task.
Alex
Yeah. Literally in this example.
Anton
yes.
Alex
So the way that you get the output that you want is to either essentially either prompt your way into it, so you're forcing the behaviors that you want, or the more classic approach is you're fine-tuning and essentially giving it more examples. So you're saying, "Here's the examples of what I want. Here's an input, here's an output. This is an example. Add this to your knowledge of how to I interpolate between when you see it at inference time." So what you do is early on it would ask, it would either just make up names, it would make up email address, it wouldn't ask anything, it would just, all this kind... And so you have to work through each one of those. Each one of those is a data set. So it's like, "Okay, well," it makes up... It always fills in John Doe, it's very annoying. How do we make it not fill in John Doe? Okay, well, there's some prompting hacks we tried that worked for a while, but-
Anton
What were some of those, if you don't mind?
Alex
Oh, yeah. A good prompting hack was like user information, name, bracket, undefined. So the LLM says, "Oh, well, okay, clearly I don't have the user's name, here it is telling me that it's undefined. I should ask." And so that was a prompting hack that then enables you to generate more training data.
Anton
So you have an in-context example of-
Alex
Yeah, we did at some point. I don't think that we do anymore, but-
Anton
Okay. And then you moved off of that and what do you do now?
Alex
For some information, it's still in context. But I guess my point is, what you're trying to do is use all these tools to get the output that you want. Sometimes it's prompting, sometimes it's prompting to generate a data set to fine-tune a model so that you get the output you want without needing the prompting, sometimes it's both. Sometimes it's, in our case, rarer now that we have a lot much better support for diverse HTML content. But sometimes you have to go in and hack something, like add a hint to what this control does, or fix a bug in basically the structure of your input. So it ends up being this very full stack experience where it's like, okay, well, webpages are complicated, so how do you know when it's done? It's actually quite difficult to know-
Anton
That was actually my next question. But before we get onto that part, still the question is, you've got these prompting techniques for helping the model understand, "Hey, I don't actually have this information," hallucinate it. By providing it examples where it wouldn't have the hallucination, where it wouldn't have that information in context, and then basically the model can pattern match on that, as you said.
Alex
Yeah. You know what, you're right. You're right. It's good to dig into that example. So let's say through some prompting tricks or just rewriting output, manually correcting outputs, in the case where the name is provided in the chat, then you fill it in, and in the case where it's not provided, you ask a question, and now you have some number of examples of each of these, ideally in diverse scenarios. And then you fine-tune on that, and now you're interpolating between these two, essentially two directions. "Do I have this information?" In which case they fill it in, "Do I not have the information? I don't fill it in." And also in the mix there is constraint checking. So, "Do I fill a field with a value that is not derivable from the context?"
Anton
Yeah. See, this is one of the things that I think about a lot, because in my opinion, and I promise not to get into AGI discourse, but I have to briefly for this part, I think it's actually really important for us to figure out how we can get the models to understand what they know and what they don't know without having to positively train in these negative examples like you're describing. So what I mean by that is, you're forced to generate a data set to tell the model it doesn't know something, which is not really the way that humans work. We don't really need examples of things we don't know to understand that we don't know them.
Alex
I'm not sure. I'm not sure if that's true.
Anton
Yeah, I mean it feels that way to me anyway. I just think that this is a really fundamental capability, and the different approaches to thinking about it I think are actually really important to get the next gen applications as well.
Alex
Yeah, I guess one thing to consider there is, when you're in the land of fine-tuning, you get some amount of generalization. And so essentially what you're doing is you're saying, "How many examples do I need in order to generalize about this concept?" Name is one example that keeps popping up where we need to... But maybe the general case of, "I need to fill in some information, I don't have the information."
Anton
You don't generalize enough about that.
Alex
Yeah. Maybe it is more than just the name, you need that same pattern of what I know and what I don't know, you need it for addresses and phone numbers and et cetera, but at some point you start generalizing, when the model's able to say, "Oh, okay, instead of just making something up, I'll fill in what I have, or I'll ask a question." And so at some point it pops and then you have some confidence, and this is all tested with eval so when you're succeeding, when you're failing, what percentage of the time you're succeeding and failing. But at some point, you're generalizing. And unfortunately, it's like, again, life is very complicated, and so-
Anton
The world is fractally complex.
Alex
Yeah. So it's like you can't say that, "We're generalizing about form-filling accurately, and so now we're done." You're constantly looking and saying like, "Okay, well, it broke here. Let's add an eval. Is this a recurring thing? How many times does it occur? Okay, do we need a custom data set? Can we do..." You go through the whole routine, "Can we prompt it? Do we need a data set? Is there an existing data set we can upsample, downsample," all this kind of stuff.
Anton
So we have two approaches here, we've got the in-context learning via prompting, and we've got fine-tuning to help the model understand what it should call out and ask for questions. The very next thing that you said here was, how do we know we're done? So how do we know we're done? How does the model know it's finished and should stop executing tasks?
Alex
Oh, that's a good question. So doneness for us is fairly open-ended. We're making an interactive online in app, so essentially our control loop is something like, we do stuff until we ask the user a question and then we stop. Or we do stuff until we send a user message and then we stop. And the user then chooses to do something else or not. And so that keeps the control loop very simple, and it also means we don't ever have to identify really success/failure. That's more of an offline thing.
Anton
It's on the user.
Alex
Yeah. I mean, you still want to say like, "Hey, you've booked your hotel. Congrats. Anything else you need?" And if the user trails off, then that's up to them. If there is something else that you need, then you're in that same control loop, so there's not really any done state.
Anton
So again, just a diversion here, but do you find, let's suppose we've gone through the flow of booking a hotel and now the user goes, "Oh, I also want to book a flight," you're still maintaining that entire hotel conversation and context, right? Because you don't know which parts of it to dump out or not.
Alex
Yep. Yep.
Anton
So how does that work? Because now the model, because you've got so much hotel information in it, it's thinking about hotels, and now the user wants it to book a flight. What happens?
Alex
Yeah, it's difficult. So this is where you get into... Important to realize is when you're fine-tuning, you're actually learning how to pay attention. Not just-
Anton
That's interesting. That's a really good way of thinking about it.
Alex
Not just what to do. Well, I guess attention is part of what to do. So what you're paying attention to in the context is something that you're learning. And we see this now with, the crazy thing, and why I said earlier, why the metaphor of a function approximator really is the best description that I have so far is because, I don't know, for the last years we've heard like, "Oh, LLMs can't do this. They can't do long context, they can't do multiple needles in the haystack, they can't do this, that the other thing." And then someone comes up with a data set, turns it into a model, and then the model does it. And so I'm very, very skeptical whenever anyone says the models can't do that. And so for a lot of these things, it's really just thinking creatively about the problem. So it's like okay, well, it's confusing. Here's two tasks. One thing you can try is randomly choosing two tasks and sticking them together so the execution is consistent for each one.
Anton
So this is for generating a data set, right?
Alex
Generating, yeah. And then there's cases where, in the hotel plus flight booking for instance, is interesting because then often the data bleeds into two. And so what you really want is labelers to go through and accomplish these complex tasks. And then as you create more samples, you're learning how to pay attention to what's important.
Anton
Right. So let's zoom out a little bit, because all along the way there, we've talked about fine-tuning, we've talked about labeling data sets. Obviously, that means at some point, you've started using your own model. Are you running your own inference, or are you offloading to a fine-tuning API? What's, when you say fine-tuning, what's actually going on?
Alex
We would if we could. It's really crazy. From my point of view, a lot of choices that the larger labs do make total sense to me now. Whereas before they were more opaque.
Anton
Such as?
Alex
Well, it makes sense to have a very crisp idea of what is pre-training and what is post-training. Because your post-training is a lot more malleable, it's a lot faster iteration, your pre-training is, you want to have it as general as possible so that you can fine-tune for different scenarios as well. We did a bunch of reasoning work to improve the reasoning, so part of this is that you're thinking about what to do next. And so a bunch of techniques for how to reason about the state of the page, how to reason about the state of the chat, how to reason-
Anton
What do you mean by that? When you say how to reason, are you showing the model examples of how a human would reason about the content of a page? What does that mean?
Alex
More or less, yeah. It's like-
Anton
What does that data look like in practice?
Alex
So a common technique is that you're using an LLM to ask a bunch of questions, get the answers, and then as an offline step where you're basically doing a whole bunch of reasoning, is this the right page? Do I have enough data? Is everything in an expected state? What are my next steps? What's the next logical thing to do? And each of those are variables, and there's many more. And so what you can do is ask all those questions offline and then decide what's relevant, and then train the model on that output. So now you're mirroring an elaborate offline reasoning process online, where you're able to say, "Okay, well, the page is not right," or, "I have the information I need, I know what to do next." Reasoning is a overloaded term. I would say more thinking about what to do with regards to the information that's available to you.
Anton
That sounds a lot like reasoning.
Alex
Yeah, I guess it's a messy distinction.
Anton
Let's continue down the fine-tuning road for a minute. So you mentioned LoRa, which is something that you unlock if you have an open source model, you have access to the weights. LoRa requires you to have access to the weights, unless someone's providing you with a LoRa API. You're basically only tuning a subset of the parameters, and the subset is chosen in this particular special way. It's like an approximation to the full space of the model. And we fine-tune that approximation, and then we somehow modify the full weights of the model, according to me. I think that's a pretty reasonable [inaudible 00:27:08].
Alex
Yeah. Okay. Maybe the easiest way to say it is, instead of changing all the weights of the model, you're making a diff. And then that diff is much more succinct, and so now you can move it around as a file, as opposed to multiple gigabytes that you need to worry about.
Anton
Yeah, that's good.
Alex
And so there's been lots of cool work with swapping them out dynamically or having a bunch of memory with lots of reused memory. All very similar to how OSs work, by the way.
Anton
Yes. Yeah, things are converging in that direction. I think about that a lot right now, actually. And of course I think about especially the role of memory and storage subsystems when we start talking about these things as general purpose computers, which is my business.
Alex
Sure. Just to be clear, LoRas are not as good. But the other aspect of it is, we're a small company. It's much more time and cost-effective to be doing LoRas. And so in practice, we ended up using it for everything because it saves time and money.
Anton
Right. Gotcha.
Alex
It was especially useful when GPs were harder to come by.
Anton
So let's talk about some of the data. So fine-tuning, LoRa, whichever way you go about this, you need data sets, and you need labeled data sets, whether you're using supervised training where you have positive labels of like, "Hey, this is more of the stuff that I want," or contrastive where you're able to do positive and negative labels. And we talked about two data sets, you mentioned human labeling and you mentioned labels coming from GPT-4 for GPT-3.5. Let's start with human labeling. How do you create this data set for something like Minion, which is taking consecutive actions, it's a very causal sort of thing that you want the model to do?
Alex
It might be useful for builders. Some of the intuition that I learned on Copilot, and I guess I was one of the earlier people working with these things by chance, and the intuition is like, it gets better. You can always make it better. And so the progression from Copilot was, we got this artifact from OpenAI, it was-
Anton
That was Codex, right?
Alex
No, it was a midsize model. It was trained on code that they had that they thought like, "Hey, maybe we can do something with this." And we tried a bunch of stuff with it, and sometimes it was right. About 6, 8% of the time you would ask it to generate a function with a test, and it would generate the function correctly. We would have to generate like 10 times and then run the test, then see which one was right and blah, blah, blah. And it's like oh, this is pointless. Try to make a UI around it where you'd pick the right one, or it would show you all the options, all this kind of stuff. And then the next version of the model was 15% of the time it would do the right answer. And it's just these little jumps all along the way, 15 to 30, 35, 40, 50, 60. And by the time we shipped Copilot about 12 months later, we had probably the craziest eval you can imagine, which is, I can get into it if you want, but it started with some sub 10%, and then up around 60%, 70% would be-
Anton
Well, I mean, what was that eval? It sounds crazy, I'm very curious about it.
Alex
So we would download a random GitHub repository that was in Python that had pytests. We'd take a random pytest that we could execute successfully. We would blank out the body, have the LLM write the body, rerun the test, and now you've got a true/false. And so in the beginning, it that always failed, and then Incrementally over time, better prompting, better models. At that time, we weren't fine-tuning, they were producing better models, we were improving prompting a lot, and towards the end, 60, 70% of the time, we're generating a function body. And that's insane.
Anton
To go from nothing to 70% is pretty remarkable.
Alex
Yeah. That's the function approximator ideas. You start out in the very coarse understanding of what the real function is, and over time you're refining, you're showing more relevant examples, so that it knows how to interpolate at every point, to the point where you can get very, very accurate. And so there's this whole, I'm sure in a year or so people will be talking about it, but there's an emotional curve to developing these kinds of things, where it's like you're stuck at 15% and it's like this is never going to work, there's no way out, and you just have to... That's when you try random stuff, right?
Anton
Yep. And again, everyone we've spoken to says the same thing.
Alex
Really?
Anton
Yeah, absolutely. When we spoke to Flo, and with Rahul, it's the same thing, it's like we couldn't get it to do something, so we just started trying things, and eventually one of them stuck and we doubled down on it, and it worked out.
Alex
Yeah, it's crazy.
Anton
And that really seems like a pattern in developing with these things.
Alex
Yeah, it's crazy because it's also a fundamentally different way of writing software. And it's interesting because there's some people that are able to make that transition and some people that aren't, from a deterministic system that you can reason about to something that is much more trial and error.
Anton
Yeah, and I also think that, look, some of my favorite work in this whole space right now, just working with LLMs in general, even though it's not a practical application, it's just like, some of the folks who are really pushing the edges of what the models do and are having Claude talk to itself and seeing when it finally wigs out and does something crazy, or simulate a fake web, stuff like that. Because I agree with you, this is a very different way of developing software, but I also think that it has this kind of pernicious thing. So all software has this problem, and it's the reason why you write bugs. All software has this problem where you have a mental model of what you expect the computer to do, and then when your mental model doesn't match what the computer's actually going to do, that's when you write a bug. In LLM land, at least computers are deterministic, at least computers if you actually sit down and reason about it because you have a spec of how the computer works, you're going to get to the right answer. In LLMs, no guarantees. All bets are off.
Alex
Yeah, sure.
Anton
Although it's good to develop an intuition like yours where you're saying, "Okay, well, clearly I need such and such an example so that it can interpolate between them and actually work it out." Let's go back to fine-tuning for a minute, because I'm very curious about that. So we've got human labelers, and don't feel the need to reveal any secret sauce, but you've got human laborers, presumably they're performing these tasks, you have some version of recording what they're doing on these websites, and then you're asking the model to basically make predictions in the same that way we make predictions-
Alex
That's gone through a bunch of iterations. Okay, so I guess the other crazy thing to think about, and it's like magic, basically, which is that the models help you make the models better.
Anton
Talk about that.
Alex
So for instance, for the first several iterations of labeling, what we would do is we would use the existing model to, or the best model that we had, maybe not the production inference model, but whatever the best model we had was, we'd ask it for a generation. The label would then say yes or no, and keep generating until the right thing happens. And then we get into a world where, okay, well, the labeler knows what it's supposed to do, but maybe the labeler can hint it a little bit in the right direction. So we did that for a while where the labelers were, we'd give them the suggestion, they'd say that's wrong, then they'd tell us what to do instead, then we'd regenerate and then use it. And eventually the model knows how to reason about these cases enough where you can literally just watch the labeler act, and then the reasoning makes sense.
Anton
Gotcha.
Alex
You're able to backfill the reasoning because you've got a model that's seen a bunch of stuff.
Anton
So for the label datasets, for the fine-tuning and the LoRas, what's the scale of these things, roughly? How much label data do you need?
Alex
Oh, not much. Way less than you think.
Anton
Order millions, order hundreds?
Alex
No, no, no, no. I'd say probably, for us right now, it's always in flux because you're changing data, you're shrinking datasets, you're rewriting datasets, all this kind of stuff. So I think probably we have, I'm going to guess maybe 140,000 samples, of which we use 30.
Anton
Thousand or 30?
Alex
30,000.
Anton
Gotcha.
Alex
30,000. I mean, that's like a refining process.
Anton
So you generate 140,000 samples, you refine them to 30,000.
Alex
Yeah.
Anton
How do you do that?
Alex
And that's always in flux.
Anton
Yeah. What's the refining step?
Alex
All sorts of stuff. Sometimes it's humans going through and being like, "This is wrong. Do this better." And then we'll toss it back to label players with that human advice on what to do better. Then you train a model to do that, and now you've got a human do it, and now you've got an LLM do it, and you calibrate the two, and you can do that at the... We care a lot about trajectories, so most LLM work these days is just single steps, so for us, we need both, so there's correctness at the step level, correctness at the trajectory level. And so we have graders for all of that, there's hallucination checkers, there's reasoning checkers, there's what else? Yeah, other stuff. There's lots of checkers.
Anton
Gotcha. So after generating the data set, you go through this refinement process, multi-step, got a bunch of things, you've got roughly 30K samples, and then, is that for a fine-tune or a LoRa, or-
Alex
That's both. We tend to use the same data sets for both.
Anton
Which is more expensive? I won't ask you directly what the cost, but which is more expensive, the compute or the labeling? GPU hours cost money, labeling hours cost money. Which one ends up being more expensive to generate the status and then fine-tune on it?
Alex
I mean, at the scale that we're at now, the quality is still so high from humans that it's worth spending the money. As the model gets better, for instance, we're able to use self... We haven't talked about self-play at all, but that's a big part of it. So that's when you have the model steering itself, and you have a user model that's acting as a user would, you're synthesizing tasks because another part of this is synthesizing tasks. It's like, okay, let's say, not so much now because we're starting to see generalization effects, but let's say, okay, we want to handle Amazon. Let's use an LLM to generate some tasks. So easy thing to do is generate a very full description of a task early on, and then have a labeler go and execute that task. But that actually introduces a bias because most people don't know what they want all ahead of time. You have to extract it from them. And so then another refinement step is to introduce a user model where you give it a blob of information, which is a task description, and then you say, "Hey, act like a human would and answer questions as they're asked." We tried all sorts of stuff. We tried different bios to introduce randomness. I'd still like to introduce a schizo bio where it's just like, just completely changes its mind all the time.
Anton
I mean, there's a recent data set of the web-generated personas, there's like a million personas.
Alex
I saw that, yeah.
Anton
There's probably something like that in there, if I had to guess.
Alex
Yeah, it's interesting. In here, you might run into a limitation of what a black box model might be able to do. Often, they're not creative enough to approximate what a human would do.
Anton
Yep. Well, if they could, they would, right? But-
Alex
Yeah, exactly. They just may not be-
Anton
The limitations are there. They have to be.
Alex
Yeah. There's an implicit bias in training sets, which is, people publish stuff that tends to make sense.
Anton
Yes.
Alex
Right? The stuff that's on the internet kind of makes sense.
Anton
Yes.
Alex
Humans don't always make sense.
Anton
Not even to themselves. People don't necessarily even hold stable preferences from day to day.
Alex
This is a fundamental belief of mine, which is, in Copilot days, we tried a lot of things with, tell them what we want and then generate it, and we found that, I think generally people don't really think ahead of what they're doing. I think people are much more like a React model where it's like you're-
Anton
I actually think that's one of the more powerful user interface paradigms that come out of AI. And I don't think you could do that with traditional software very well before. With traditional software, you have to guide people through this very specific flow, first, do X, then do Y, then do Z. And the human using that software has to know what they want, not just overall, but at the time where you're presenting them with your option. With AI, you can drill down in whatever direction you want to go.
Alex
Yeah, it's true there's a lot more variability. I mean, there's a lot more [inaudible 00:39:45].
Anton
But even think about using something maybe Midjourney, wherever you're generating an image and the first image it spits out may not be anything like what you actually imagined, but it's different in ways that you can now get a handle on and you can move around in your own conceptual space, because it's this iterative interface. I think that's really, really interesting and under explored.
Alex
Yeah, for sure. For sure. I mean, it's the same thing with co-generation, right?
Anton
yes.
Alex
It doesn't always do what you want, but sometimes it's right.
Anton
Yeah. And sometimes it does help you reason about, even when it's wrong, it helps you reason about the thing [inaudible 00:40:12].
Alex
Yeah. I mean, there's a very deep question here about, I'm able to evaluate code, or an image, as to whether it's correct or not, much faster than I can search the space of all possible things, and so-
Anton
It's the generator-discriminator gap, it's classic, right? In ML, and it turns out probably cognitively, it's classic.
Alex
Yeah, it's interesting.
Anton
Stepping back and just thinking about this space in general, one of the things that I think is really important to think about as an AI application developer is, what's your plan for as the models get smarter? So I'll put that question to you. The models are going to get generically smarter over time. What does that mean for Minion?
Alex
Oh, yeah, that's a good question. What does that mean for Minion? I think if we do it right, we can ride the generalization, which I'm most excited about. For instance, one of the early findings is that we have some Amazon data, we don't have any Etsy, eBay, Shopify, Flipkart, any data on that. But the Amazon data improves the performance on all those other sites. And so that's really exciting. That means that you're generalizing. And so my hunch is that the web generalizes, which is extremely exciting, because that means that you can imagine an agent that is reliable for scenarios that are in the long tail.
Anton
Yep. But one can also imagine a future here where the web just goes away.
Alex
Yep. I think they're related. In some sense, the function-calling approach that Apple is trying to do is something that's more... It's a constrained space. I know these apps that I have installed, I know these APIs that they have, pick the right one. The web is much more broad. Here I find myself on a random lawyer's website and I need to fill out a form, or I need to find the form to fill out, or I need to decide I've looked in all the obvious places for the hours that this restaurant is open and I can't find it, so what do I do? One of the early positions we took with Minion is that, because most of the things that people call agents are API calling, so there's a bunch of APIs, you have the LLM call them, you insert the output into the prompt, and then you regenerate. And so my hunch early is that these APIs are so disparate. You've got one API that does a chunk of behavior, and then you've got another API that does another chunk of behavior. And so what's to know that there's not a gap between those two APIs where I can't do certain things? Whereas the web solves that because the web is what's being... Maybe another way to put it is that the APIs that are exposed in the world are less functional than what you can do on the open web.
Anton
Yep. That's true.
Alex
If you can do it, it's on the web, more or less, whereas it's unclear with APIs. And so even if there's a little bit of a discord there, that means you've got to fill in all the gaps by manually writing APIs to figure out, and figuring out where those gaps are. So I guess for me, the web in general is more easy to interpolate between what's possible.
Anton
Right. Because there are more possibilities.
Alex
Yeah, yeah. It's like I'm not trying to bucket into these two APIs that I have, that both make sense. Instead, there's always alternate paths I can go down that maybe accomplish the same thing.
Anton
All right, two final questions. First one is, you've been working on this for pretty much as long as anyone, and when I say this, I mean AI applications.
Alex
Sure.
Anton
Probably as long as anyone. 2016 is a very early start to be trying to build something like this.
Alex
There's people who have been doing it much longer.
Anton
Yeah, of course, but it's like the ecosystem's tiny and baby, but it's growing now. So as a relative proportion, there are fewer people who've been at it for this long.
Alex
Oh yeah. And I have a total imposter syndrome because when I got into it in 2016, everyone had a PhD.
Anton
You don't need one anymore, thank God.
Alex
Yeah, I don't think you need one anymore. Maybe you never did. In fact, the reason I stopped working on it right before Transformers came and only picked it up in 2020 was because I'm like, "Well, I can't make progress on this without a PhD. This is stupid. I need a math PhD to make progress here." And that was wrong. I should have just stuck with it.
Anton
The world has changed very much since then.
Alex
Yeah.
Anton
Look, that's really the big explosion here is, you no longer have to be a part of a large AI research lab to pick up and work with this technology. It's as general, as ubiquitous as the web itself. I can hit an API or I can run even a local model these days and build something. So my question is something like, in hindsight, from where you are now, what decision would you have made differently or earlier, given your current knowledge? Specifically with respect to how you're building Minion.
Alex
The number of people that will look at 5% performance and say, "It doesn't matter, we'll get to 70 or 80 or 90," is very few. So stick through. Stick with it. Unlike other software, you make it and then it either works or it doesn't, there's this element of goodness. And in some sense, that's why you've seen AI be mostly true believers. Because you have to pound at it, think creatively, lots of setbacks. I try a thing that I'm sure it's going to work, it doesn't work.
Anton
But I think part of that is also the fact that you do see it get better. And that's how you get that kind of true belief in the first place, is like, "Wow, this actually does improve if I keep hacking at it."
Alex
No, it's crazy that when hear old, crazy theories from OpenAI, people and they're like, "Oh wait, I get it now." It's like the idea that-
Anton
Well, especially when you're doing a pre-training run for very large models and you can literally see its capabilities improving, it gets much easier to believe that their capabilities will continue to improve.
Alex
Yeah, yeah. Or even the crazier stuff, like it's the future sending some pathway through the past to align these things. Because that's what it feels like, is like you're watching these things generalize, where generalization means I haven't shown it what to do, but it does the right thing, and that starts small, but I think it snowballs.
Anton
So last one, let's assume that I've decided I want to build something, what's the giant sticky-outy piece of advice that you should give that person? What should I know from industry veteran?
Alex
Ah, okay. No one knows what the answer is.
Anton
Yeah, I think that that's right.
Alex
No one knows, so-
Anton
It's too early.
Alex
It's too early, but it's also, again, you're dealing with language, which is infinitely complex, and now more and more we're dealing with the real world, which is even more infinitely complex. And so it's unclear to me that the answer is knowable. There's no one way to be, there's no one way to live, there's no one way to believe. And so I don't think there's a single way to build AI software. It's really very contextual.
Anton
Right. Makes sense. Cool. Thanks a lot.
Alex
Yeah, sure. Cool. No, good discussion.