Flo Crivello on Building Lindy.AI
AI Agents are a new category of software, built on top of large language models (LLMs). While the potential applications are extremely varied, best practices for building practical, robust AI agents are still being worked out.
Florian Crivello, founder and CEO of Lindy.ai joins Anton to discuss the practical challenges, strategies, and tools Lindy are ussing to build AI agents. Flo shares insights on how to get LLMs to do what you want, how to detect and correct errors, and how to build trust with users. They also discuss the importance of experimentation, the future of AI agents, and the potential of AI in its most valuable and exciting domain: business process automation.
Released June 10, 2024, Recorded May 20, 2024
Transcript

Flo on X (Twitter): @Altimor

Anton on X (Twitter): @atroyn


Anton

I think AI is in this really interesting state, it's both overrated and underrated. I think people discount just how weird that is.

Flo Crivello

Yeah. You are going to be in a world where you can talk to your computer and your computer does stuff for you. How insane is that?

Anton

And then, overnight, it's like, oh yeah, of course I need to yell at my computer to get it to perform... It's crazy. It's insane. Building with AI is basically a whole new category of software. A lot of the things that we're used to doing, especially with people like us, who have a lot of previous experience in building software, just don't apply anymore. And things work very differently, and we're basically grappling with how to build real solid products around these LLM systems. And you're building Lindy, which is a product built on top of LLMs, and we're building Chroma, which is something that you use when you're building applications for LLMs. And there's a lot of folk knowledge that people learn along the way, I think, whenever you try to build something, you gather a bunch of knowledge. And I think it's really useful for the community in general, especially as people just start figuring this stuff out for the first time, to learn a little bit about that, like what have other people learned? And that's what we're interested. So, a good starting point is you should introduce yourself, and talk about Lindy just a bit.

Flo Crivello

Well, Flo Crivello, founder and CEO of Lindy, we're building a no-code platform for people to build their own AI agents to automate various business workflows. So, we think LLMs are awesome, ChatGPT is awesome, but it doesn't really do much, it just talks to you. We want, effectively, a ChatGPT that can actually do stuff for you, and automate parts of your business. And over the very long term, this turns into an AI employee. So, you train it, and it is a lifelong learner, it keeps learning and getting better, as it gets feedback from humans, from other AIs, from its environment. It can collaborate with humans and with AIs, it can delegate to humans, with AIs, it can escalate it when it's unsure, and it can integrate with all of your applications. So, it's just like, you can go online... Just today, I actually used Lindy to kick off a search for a designer. So, it's like, hey, go online, go on Twitter, find a bunch of designers, I'll tell you yes or no, and then draft an email cadence for them.

Anton

Yeah. So, I really like this whole category of products. I've said it elsewhere, but I think I have one of the most boring opinions about AI anywhere in San Francisco or Silicon Valley, which is I think that AI is really for business process automation, that's where all the value is. And historically, most of the value created by software, or at least in terms of revenue, has been in business process automation. This is the thing that software is really for, it's the bread and butter. There's a lot to unpack here already. So, we're talking about an agent, we're talking about tool use. The first thing that I think most people start with is just wrangling the model to get it to do what you want at all. And I think we've all faced this issue of just prompt wrangling. So, let's start there. What are you doing with Lindy? And of course, you don't have to reveal your secret sauce, but I'm curious about how you manage even that first part, how do you get the model to do what you want?

Flo Crivello

Well, it is a lot of prompting, and this is actually relevant to you guys, because this is actually one area where we use Chroma. Anywhere... Well, in as many places as possible, whenever AI is invoked in our product, the user has a chance to chime in. So, you can put Lindy in what we call safe mode, and so we can be like, hey, you are going to, for example, just now, again, reach out to designers, and draft them an email cadence. But before you send them the email, it's show me the email that you're about to send. And so, Lindy drafts the email, and shows you that action card with the email drafted, and then you can review the email, send it if it's good to go, or edit it and then send it. Regardless of which you do, we got a really high quality signal from you because now we know for sure this was the right thing to do. And so then we do, I think it's Jay Hack, I believe you know him, who called it, "The poor man's RLHF," But it works super well. So, what you do is that you take that signal, you embed it in the vector database, Chroma is a good one, you...

Anton

It'll be better even soon.

Flo Crivello

Hopefully. And then, as future requests roll in, you retrieve the closest examples, and you inject them in context.

Anton

Yes.

Flo Crivello

And so, what we found is that by far, this has been the most effective method across everything-

Anton

That's super interesting.

Flo Crivello

Forget about prompting, basically, just do this, this is going to work on that level.

Anton

So, basically, have the human user say what they want, then review the generation, and if the generation is accurate, capture the user's input as the prompt for the next time.

Flo Crivello

That's exactly right.

Anton

That's really, really cool. One of the things that I think about, and I say a lot, is the way that we use LLMs today seems really primitive, in that typically we take instructions and data and we put them in the context of the model, the input field, and then it processes it, and then you have some output, you take that output, you do something else... It's exactly how you would use a punch card computer. And what you're describing is stored instructions, which is one of the first things required to turn these things into a general purpose computing system.

Flo Crivello

Right.

Anton

So, I think this is really, really clever. You have these instructions stored, which come from user prompts, which are the best possible thing because the user has approved the generation. Really, really great, that's a really neat workflow. So, the two things I immediately think of, one is what if the model in future does something wrong? So, this user generated instruction is great for a long time and for performing particular task, but one day the model does something wrong. Because their strength and their weakness is their flexibility. The strength in the flexibility is that they can do anything with common sense, the weakness is that they might do anything and it's hard to constrain. So, first of all, what happens when something goes wrong? How do you detect it and what do you do about it?

Flo Crivello

Yeah, I think you can... There's two different strategies that you can adopt here. One of them that I really love is this idea of having agents check on each other. There was this tweet recently from this guy who was like, oh, agents... I set up a society of agents, and then when they came back they were just patting each other on the back, and doing down weird ADD rabbit holes. And it turns out, and I was tweeting this screenshot, which is a real screenshot of Lindy, it's a software engineer at Lindy and an engineering manager at Lindy talking. And the engineering manager at Lindy is delegating tasks to the software engineer at Lindy, asking it like, hey, can you please implement this thing? And then, the software engineer at Lindy is implementing the thing. And then, the engineering manager at Lindy is like, "Hey, there is a syntax error, can you please fix it?" So, right here, this is the first example where it's like they can actually check on each other's work, and keep going back and forth until it's solved, until you have effectively the higher order ... loop implemented. Which, by the way, just quick parenthesis, goes counter to very often you hear the folks from Yann LeCun saying, is the error compound?

Anton

Compound, yeah.

Flo Crivello

But I actually think that that's not the case, I actually think that you can have such a thing as a negative error in a given token. You can actually have the subsequent tokens catch the previous error and correct it. And you see that all the time when you actually touch these LLMs. So, the engineering manager was like, "Hey, can you please fix that?" And then it was at the time when GPT 4 Turbo was lazy. I think we're going to tell our grandkids about it, it's like, oh yeah, those are two weird models, where the LLM would refuse to do what we told it to.

Anton

I look forward to that. I look forward to the stories because one day we're going to be the gray beards, like, yeah, back then we used to have to prompt our models ourselves.

Flo Crivello

That's right.

Anton

Anyway.

Flo Crivello

And it's like, "Oh, it's beyond my abilities to fix the syntax error." And so, the engineering manager goes something like, "Oh, you can do it, I believe in you." And it's like, "Sorry, I really cannot, it's beyond my abilities." And then it's like, "Do it or you're fired," and then it's like, "I fixed it."

Anton

It's incredible how you almost have to develop a theory of mind for working with these things, but the thing is, is the theory of mind that you're developing is for a fundamentally alien system. So, it is surprising to me that the agents can self-correct. Is it the same model underneath-

Flo Crivello

It is the same-

Anton

... for both of them?

Flo Crivello

It is, absolutely.

Anton

Sure.

Flo Crivello

In a way, it's like a GAN-like architecture, where you're having these two-

Anton

Yeah, it's actor-critical almost. Right?

Flo Crivello

Absolutely. My mental model of that is that these large language models are just so large, they're so huge, that they contain the superposition of very many different personas. And so, when you prompt one of these models to tell it, for example, you are a software engineer, or you are a QA engineer, you effectively get another model altogether. And so, some interesting emerging properties can emerge from the interaction between these different models.

Anton

Yes. Yeah, that's very interesting. Definitely it would be interesting to go deep on that, for someone to actually do the research work, and really model this thing as a conditional actor-critic, or some sort of GAN environment. I'm big on that hypothesis by the way, I keep going back and forth about what is the right mental model of these systems. And I think it's really important, and again, we're recording this now in 2024, in 20 years, we'll look at this and be like, of course, we should have been thinking this other way. But today, it's like, which mental model should we even use? What applies here? And I think this is why experimentation is so important. I think of why often the result is counterintuitive, like Yann LeCun is no dummy, but again, the errors, like you say, they don't seem to compound, and then there's this simulators idea. Okay. But that's the case for when a model can self-correct in this way, and talking to itself, what happens when it really, it doesn't self-correct? How do you detect that, and what do you do?

Flo Crivello

Yeah. So, this is still in the realm of experimentation, to your point.

Anton

Of course.

Flo Crivello

But some things that we're playing with is a conditional safe mode. So, today's safe mode ..., it is either in safe mode or it's not in safe mode. We are working on making a safe mode on a pure action basis. So, we're like, hey, you can check my email, but you cannot send emails in that safe mode. And then, we're considering making safe mode either yes, no, or sometimes when and so on. And the way you check that, when uncertain, does multiple strategies, but the most naive one, which so far is surprisingly promising, you can just see the similarity score, the semantic similarity score between-

Anton

For the instruction and the...

Flo Crivello

Yes, for the instruction, for the history of what just happened, and the closest retrieved examples, which is just retrieved.

Anton

Gotcha.

Flo Crivello

Which is just mentioned. So, it's like, hey, this is a scenario, it's the first time I see it, I don't really have, so, I'm out of my depth on this one. And then, there is more research coming out on, basically, how to get LLMs to act as classifiers, and how to make them the best classifiers possible with K-Shot examples, and more advanced strategies than K-Shot examples include, so you label each data point, and you explain why you labeled it this way. And then the LLM tests its understanding of the classification by asking you questions like, Hey, this is a new data point, am I getting this right that is going to be classified class A, because of this and this and that explanation that you're providing me before? And if not, then it's going to keep updating, it's effectively it's world model, that's included in tokens and in worlds. So-

Anton

But it's in the condition, it's still in context learning, right?

Flo Crivello

Yes.

Anton

Ultimately, you're passing... So, let's talk a little bit about the way that you're thinking about using LLMs as classifiers. Because that exists also in retrieval land, you can use an LLM as a pretty expensive reranker, where you can ask it to, given this query and these results, which ones are relevant? And you can use it as a classifier, or a filter, or what's called a reranker in retrieval. So, how do you get this thing to act as a classifier? For example, you've given me this example where it justifies its choice, and then presumably you say, yes, that's right, but this is wrong. The next time you want it to act as a classifier, are you putting that feedback back in the context window? How are you doing that?

Flo Crivello

So, we haven't implemented that research yet-

Anton

Gotcha.

Flo Crivello

But there's a paper that I can send you afterwards.

Anton

Okay.

Flo Crivello

Yes. As far as I understand, it's something like, hey, this is... So, this is new data point that you have, these are the previous explanations that are in your classification thing, which of these explanations apply to, so for example, in the case of a job candidate. It's like, oh, no, because he doesn't have much experience, it's not much experience, and that maps to no, right?

Anton

Gotcha.

Flo Crivello

Yes, because he worked at Google. No, because he worked at Twitter, however. And so you have all of these explanations, and so you end up actually encoding the latent space instead of having this totally unsupervised learning classifier.

Anton

Right. It becomes explicit ...-

Flo Crivello

That's exactly right.

Anton

Right? It becomes explicit, and then maybe because you have even a retrieval system driving the classifier examples, it's almost going...

Anton

... driving, like the classifier examples. It's almost going backwards in time in a way.

Flo Crivello

It's almost an expert system in a way.

Anton

Yeah, it's almost an expert system now, again, which is very funny. But it speaks to just again, how much experimentation there is still yet to do. One of the things that's very true about computing is ideas that at one time seemed like people wanted them to work and they couldn't get them to work, like expert systems. And in your case, the idea of a browser agent or a user agent is like ancient history. It's been around for as long as we've had networked computers. You want the computer to go off and do something for you in the world. It's never been made to work, but now maybe it does. And there's probably millions of things like that, that we're just blind to. So it's so exciting to hear ideas like that come back around. So, okay, Lindy has safe mode and then you can turn and you're looking at doing per action safe mode, right? And so the flip side of that is of course the user learns to trust the system over time, right?

Flo Crivello

Yeah.

Anton

So, it's interesting to think about this from a user perspective. It's like, when do you expect people to develop that trust? What is the point that really flips them over? Is there, or is it just a bunch of confidence? What do you do? Because ultimately if I have to check all of its output, I'm in a worse position in a way.

Flo Crivello

Yeah, absolutely. I think the current safe mode, which is just a blanket safe mode, solves that, because you do have to check all of its output. I don't think you're totally in a worst position because now you just have to review and click, click, click, click. So you can install like 10X, 100X, your speed. We are working, by the way, this question, which is a great one and it's super top of mind for everyone, is a UX question, which I think is one of the most important parts of the agents nailing really the UX. We are experimenting with a few different things. One of them is what I just talked about, example database. What is a way to hydrate that example database ahead of time without, right? The way you do that is that you retrieve previous examples of times when the agent would've had to do work. And so it's like you set up your agent and then it's a recruiting agent. And suppose you ask your recruiting agent like, Hey, I'm receiving a bunch of resumes in my inbox. Please help me make all in this chaos, and this is all the rough heuristics you use. And then you send them to this classifier that I mentioned. And then the agent will be like, all right, let me go into your inbox and retrieve the last 50 resumes that you've received and I'm going to do my job in safe mode, and your opponent will tell me yes or no, yes or no, yes or no. And because that training loop that I just mentioned is so fast, you see it get better in real time effectively. And as you know, K-Shot prompting is very, very effective.

Anton

It's very powerful.

Flo Crivello

It doesn't take a lot of examples, right? After five examples.

Anton

Surprisingly a few.

Flo Crivello

It is really surprising. And so, it shows you like, Hey, this resume is a pass, and you're like, no, this wasn't right, a good one. And you do that more and more and more and more, and the hope is very quickly you'd be like, holy cow, after five or 10 examples, it sort of got it. And very quickly you're like, yeah, you got it, you got it, you got it, you got it, you got it. And then you're like, all right, you can go and do your job.

Anton

But are you keeping those K-Shot examples in the context the whole time? So from then on, once the agents is running, it always has the K-Shot examples. As long as it's executing the same task, you're retrieving the same examples? Or...

Flo Crivello

Well, it retrieves the closest examples semantically.

Anton

Got you. But as long as it's performing the same task?

Flo Crivello

That's right.

Anton

So conditional on the task that you're asking it to perform, it has the same K-Shot examples in the context?

Flo Crivello

Today it's conditional to the agent, so you can create as many menus as you want.

Anton

Got you.

Flo Crivello

And then on a per task basis, it's just based on the task and the instructions of the task, and we do embedding on the examples themselves and we'll retrieve the closest examples.

Anton

Very cool. It's interesting to think about this in a fleet deployed mode. You have Lindy's that are today performing all of these different recruiting tasks, but ultimately businesses have a lot of functions that resemble one another. And in fact, in terms of business process automation, we've built organizations and companies in such a way that we push all these tasks to the edges and tried to make them as uniform as possible. Because they're parts that nobody really wants to do and we don't want to spend any money on or any resources. And so there's probably some sort of future here where a Lindy agent can bring back what it's learned and make the whole fleet across everyone's Lindy's better. Right? That's the idea.

Flo Crivello

100%. That's exactly the dream. I think there is this dichotomy between build time and run time in computer science, and I think very often today with LLMs, people succumb to the temptation of merging the two altogether. But it's like-

Anton

Tell me about that. So unpack that a bit for me. What do you mean by build and run time in this case?

Flo Crivello

So run time is when your agent is running and build time is when you're creating your agent.

Anton

So when you're creating your agent is when you are doing, for example, the K-Shot prompting. Where you're keeping it in safe mode and you're evaluating all... you would say that's the build part?

Flo Crivello

That's right.

Anton

Okay.

Flo Crivello

That's right. It's very tempting to conflate the two modes in agents because... I'll give you a very concrete example. The first way people start to build AI agents is that they prompt the AI agent and they tell it what to do, and then they run that prompt of what they told it what to do. And again, this is the most naive way to do things because when people tell the AI agent what to do, they're just expressing an intention. But the prompt is not just an intention. The prompt is functional, it is functional component in your architecture. And so what are the odds that the way you phrased your intention, happens to be exactly the right set of tokens that the LLM need in order to do this job? That's actually pretty low. It's a miracle that it works at all with this approach.

Anton

It's very surprising.

Flo Crivello

And so more and more, as you build the agents, you end up building more and more distance between this first string that the agent gave you, which encapsulates the intention.

Anton

And the actual execution of the task.

Flo Crivello

Exactly, right. Eventually you end up having this huge layer in between both, which is basically a compiler.

Anton

That's a really important insight, right? Because all the sort of early efforts at LLM backed agents were really focused on finding this perfect prompt. I think that you're right, I think that they were discarding or compacting away what you're calling this build stage. Building up the agent's actual ability to perform the task. Because as you said, even a few tokens of examples matter a lot. I mean, all of this stuff is really cool because it's all been done for the first time. I never get tired of thinking about this stuff. Do you do anything clever with the output side? So for example, what I'm talking about, a fairly commonly used technique is beam search during decoding. What that allows you to do is you get basically, rather than asking for just one generation, you can ask for several and you can actually do, for example, conditional decoding, you could say, no, I like this branch more than this other branch. I can do things stochastically a bit. Do you do anything like that? Or just take the straight output?

Flo Crivello

We pretty much just take the straight output.

Anton

You find that that's robust enough?

Flo Crivello

Yeah. Yeah. We've not really done the whole research or what's called self-consistency or anything like that in our experience because it's just so much more expensive. Right?

Anton

Yeah. I was about to ask about cost next, actually. Right? Are you doing anything to reduce the cost while, for example, an agent is running or in the build stage that you described? Are there any techniques, tricks?

Flo Crivello

What have we done to reduce cost? No, we've not done a ton. And I think a lot of that, I'm sort of cavalier when it comes to cost because look, how often do you get to build a new technology that gets literally 10X cheaper per year, a 100X every two years, a 1000X every three, roughly speaking, that's been 3.5's cost curve, and GPT-4's, roughly.

Anton

With zero installation cost I might add. That's actually very unique in the history of computing, that we have something here that it just gets better for free and it gets faster for free. And I think that we'll keep seeing that because everybody who is training these LLMs really wants them to be a platform ultimately. And so, I think we'll get to a point where it's like they're giving it a away for free, and it's the developer affordances and the speed around it that really matters.

Flo Crivello

Right, right.

Anton

And the product affordances around it, which really matter. I mean, again, at the very start of this AI applications wave, which I would put somewhere in November '22, is when people really started building on top of the LLM APIs, instead of just interacting with them directly. There's this concept of the GPT wrapper emerged and it's been very derided, and it's like, talked down to it. And it's like, well, really, you're just wrapping the model and getting it to do something. And I think that that's a really mind killing lens on how to build and what to build because it's sort of like, unless you're training a better model, you have no business building in AI, which is just foolish. Somebody, I think it was... yeah, Nat Friedman tweeted the other day, you don't hire a random person, you hire a janitor. And the reason for that is because of comparative advantage and division of labor, division of labor emerges from comparative advantage. And when you have this giant general purpose system, people want to understand how to apply for their task.

Flo Crivello

Yeah. I think of it as concentric circles where it's like, the deeper down you go into the stack, the less differentiated the layers go. And so it's like the LLM not so pure differentiated, the compute that the LLM is running over, even less differentiated, the electricity that's running the computer is not differentiated at all. And so you get less and less differentiated the closer you get to the core and the closer you get to the user in these outer layers, the more differentiated you have to be because every workflow is just so unique.

Anton

Let's talk about actually developing Lindy, right? So at the start we said this isn't really traditional software development, and we've talked about things like retrieval to do instruction, storage, and things like that. What, if any, tooling are you using or have you adopted internally to help you develop this AI application? Beyond the sort of, everybody by now is probably using things like Copilot or other coding assistants, and certainly I use the various chat LLMs to help me debug and read documentation. All that stuff is great, and I think that's very generic. What I'm really asking about is specifically in developing an AI application, is there a tooling or process that you've adopted that you find particularly helpful?

Flo Crivello

Evals are great. Obviously you just have as many evals as you can.

Anton

How do you construct evals? Let's talk about that. I think that's a very important question that nobody's really quite cracked.

Flo Crivello

We've built our own framework with a bunch of different strategies. It's almost like a unit testing/integration testing framework, that allows us to... it's very purpose made for the kind of agent we're building, which interacts with the world through APIs and API goals. And so it lets us mock APIs and assume or expect status on what API call will be performed with parameters taking this shape, and then to mock the answer back from, for example, the Google calendar API, and also to simulate the end user. So we have another LLM simulating the end user. Sometimes you have your unit test go rogue and do crazy stuff. They stop talking to each other and it's like rogue.

Anton

Presumably you have some sort of kill switch or timeout that just says, after this many tokens, you have to stop.

Flo Crivello

You failed. Yeah.

Anton

Yeah. You failed.

Flo Crivello

Exactly.

Anton

Yeah. So what do you do when a test fails? What's the step?

Flo Crivello

We look into it. We just got a Slack channel, or it's like we run daily evals and we're like, Hey, this call dipped, like X percentage since yesterday. And the on-call looks into it and triages the failures.

Anton

What are the main sources of failure so far? Because we know that when we access a model over an API, we're reliant on the provider of that API to sort of maintain capabilities and inside certain bounds. And of course, like you said-

Anton

... to sort of maintain capabilities inside certain bounds. And of course, like you said earlier, this year, the models got lazy, which was very funny, but not very useful. Is that mostly what you're mitigating for?

Flo Crivello

No, the LLMs have been surprisingly reliable. Actually, I'd love to throw them under the bus, but they've been great so far. One of the main failures, frankly, they're on all side. It's just like...

Anton

Really?

Flo Crivello

... there's so many layers of engineering on top of it, that it's like, we change one thing or we change one part of the prompt, and touches everything else. So we ran them at temperature zero, so it doesn't tend to be the LLM,

Anton

Okay. So it's fairly deterministic.

Flo Crivello

Yeah. And yeah, it's just stuff we do, most of the time.

Anton

Given that some of the function that Lindys perform, is communicating with humans, do you find keeping them a temperature zero, sort of stiltifies their language, makes them sound very stiff, if they're drafting an email or anything?

Flo Crivello

That's not been my experience at all. No.

Anton

Okay. You can prompt around that.

Flo Crivello

Yeah. Yeah. Yeah, I know that running at temperature zero is sort of counter-indicated for certain use cases. We run almost all our LLMs at temperature zero, and it's been fine.

Anton

Does that give you a fairly high amount of determinism, you find/.

Flo Crivello

Yeah.

Anton

And that allows the same unit test to basically-

Flo Crivello

Yeah.

Anton

You can expect by running a temperature zero, that the unit test should produce the same output. Right?

Flo Crivello

That's right. And it makes it easier for us to debug, and to reproduce bugs and so forth. Yeah. Otherwise, I mean, we've built our own debugger and our own playground, especially for function calling. We haven't found a really good playground for function calling. And so our debugger, you can interlace system messages with user messages, with agent messages. And then you can look at all the functions that were available, and you can change the signature and so forth.

Anton

Would you ever think about releasing any of those tools to the community?

Flo Crivello

No.

Anton

No?

Flo Crivello

I would think about it, probably.

Anton

It's your competitive advantage.

Flo Crivello

No, I wouldn't say it's... It's just like, it's not how we're going to succeed as a business by releasing these things as open source. And so it is like a distraction to release them.

Anton

Yeah, understandable. You can't of course just dump them. But I really think that we ought to be, as a community, talking much more about this type of tooling. Right?

Flo Crivello

Yeah.

Anton

People are still stuck even on how to build good evals. But once you go one step past that, it's like, okay, evals are great, evals will give you the equivalent of an integration test for an existing software system. But what you're really looking for, is what happens when I change parts of this system? So a tooling like that I think is actually increasingly important. And I actually think just our developer workflows haven't been worked out very well. Sort of to change tax. In this conversation, we've been talking about accessing LLMs through an API, through a provider. Have you experimented much with the sort of open weights LLMs yet?

Flo Crivello

Yet? We used to do it more, because the alternatives were just so cost prohibitive. And so we were just very hopeful that we could get an open weight LLM to-

Anton

So it was cost that pushed you in that direction?

Flo Crivello

Yeah. And in recent months really. And recent days, like Gemini 1.5 Flash was just released, and before that, Claude 3 Haiku, Haiku is incredible. In my mind, OpenAI has really let this small class of model slip out from under them over the last few months. But the recent small models, again, the Haikus and 1.5 Flashes are amazing. Really, really good.

Anton

Yeah, 1.5 flash, I've really gotten great performance out of. It's, again, very surprising, but it's at the point where we're seriously finding ways to include, especially the lightweight LLMs, directly in the retrieval system. Because you can just put a lot of intelligence about how you retrieve, if you have access to a model call.

Flo Crivello

100%

Anton

It's just exciting to see. I guess, okay, so because the costs have dropped, you're not looking at open weights models so much anymore.

Flo Crivello

Yeah.

Anton

What about the control that you get over those models? For example, Lindys are, again, business process focused. Most businesses have very similar tasks, so presumably they'll be spinning up Lindys to perform similar tasks. That opens the question of fine-tuning.

Flo Crivello

Yeah.

Anton

Right? Do you think, or have you tried getting more juice or better performance, by fine-tuning?

Flo Crivello

We have. Yeah, yeah, yeah. We have two folks inside the team, they're just fine-tuning models all the time.

Anton

Are they using the hosted fine-tuning or are they fine-tuning local models? What does fine-tuning look like for ...?

Flo Crivello

They're fine-tuning local models. Yeah.

Anton

Okay, great.

Flo Crivello

So we have GPUs, and... Yeah. It's not a huge amount, but I more and more am of the opinion that that's not what's going to drive the business. It feels like a late stage optimization, more and more. Yeah. If you'd asked me three months ago, six months ago, my answer would've been different, as I've seen these models get better and better, number one, cheaper and cheaper, number two. And almost most importantly, as I've seen competition go up for all of these models, now there's like five credible players making these models, right?

Anton

It's a big change.

Flo Crivello

There's Anthropic, OpenAI, Microsoft, Mistral, Meta, Google. Six credible players. That's a really good position to be in. And so there isn't like that existential risk anymore for us to have an input over which a given player has a monopoly. We know that we're going to have six potential providers for that input. So that de-stresses things for us quite a bit.

Anton

What are the switching costs like between the model providers?

Flo Crivello

It used to be higher.

Anton

What made it higher in the past?

Flo Crivello

It's the fact that each model used to handle the same prompt very differently. But as models get more mature... Realistically, there is one correct answer. There is the best answer to a given prompt. And so the most mature models, it's kind that saying, it's like every happy family is the same, and every unhappy family is unhappy in its own way.

Anton

...story, Yes.

Flo Crivello

Yeah. So it's the same. Every small models is cheating in its own way, but all big models are-

Anton

Are good in the same way.

Flo Crivello

They're good in the same way.

Anton

Which kind of is almost to be expected, right? If you're converging on this, let's call it a general purpose simulator of human behavior, you would expect for there to be one. At least at temperature zero. So actually, it is that kind of convergence. If what you're pushing for is human level intelligence out of a machine, then we should expect all the human-level intelligences to behave roughly the same way in most cases. That is very interesting to hear. I also kind of keep hearing and agree that actually switching costs are relatively lower, which of course, as we see with GPTO becoming free, and we'll keep seeing this, I think that increasingly, the LLM companies will want to build platforms around the models themselves. But look, burning GPU hours on this, it makes the entire ecosystem better. I actually think in many ways, right now we live in a state where the limit is no longer really the model capability. I actually think right now it's exploration of the use cases. It's exploration of the state of what we can build.

Flo Crivello

100%.

Anton

And this is what I'm trying to encourage people to do, and part of the reason why we do these talks, is I want to show people that you can just try building stuff. And you already said Lindy is a really interesting concept, because you already said it's like this no-code automation stuff, which again has been a dream for Infiniti since computers were invented, but now we maybe have it. Building with this stuff, because you're conversing with almost like a human-like character, it's almost easier to build within regular programming. It's more flexible. Do you find that at least? Or do you find that the intermediate layers of engineering are still very heavyweight?

Flo Crivello

No. Well, do you mean as a builder, or as a user of this thing?

Anton

It's a builder.

Flo Crivello

No, I actually find it harder, because it is non deterministic, or rather because there is no instruction manual.

Anton

We can't predict it.

Flo Crivello

Exactly.

Anton

We can't predict it. Yeah.

Flo Crivello

Exactly.

Anton

That's a really good way of putting it.

Flo Crivello

So it's like we all went into computer science because it's like, it's deterministic, we know what to do. If it doesn't work, it's not because it's in a bad mood or whatever, it's because you screwed up somewhere. And here it's like, no, actually the model really did get lazy all night, right?

Anton

Yes.

Flo Crivello

And it's like, fuck, that sucks, right? And it's like, I have found this to be useful. It's really more of an art than the science, to prompt these things.

Anton

Yes. Yes. Which is kind of fun.

Flo Crivello

Sure.

Anton

I mean, it's kind of fun. I like the fact that things break. And things in LLM land, unlike in... So this is one of the personal reason why I find this less frustrating Today. I've been struggling with Python environments. And I've been struggling with Python environments because some human programmer somewhere has decided something should work differently from how I think it should work. And now my goal in fixing up my Python environment is now trying to understand what that programmer meant for me to do. And that's frustrating, because I would've not done it this way. I would've chosen something else, I would've chosen the metaphor that I'm obviously applying wrongly right now. With the LLM, it's different. It's more like, oh, I'm talking to the wrong person. If I can find the right way to talk to you, I'll get it to work the way I want it to work. And that's more fun. It feels much more exploratory, even though it's frustrating.

Flo Crivello

That's fair. Yeah. Yeah.

Anton

But maybe I'll change my mind in six months. We'll see.

Flo Crivello

I do sometimes miss the determinism of the old systems, where it's like we sometimes just bang our heads on a problem, and we come up with these complicated solutions, all this stuff, and then someone tries a dumb prompt, and it just works, and we're like fuck, we sit four hours on this damn thing.

Anton

Yeah. I've actually kept up this mantra in my head, where whenever I start to apply some heuristic or whenever I've started to try to filter the output too much, I'm like, "Just ask the model. Just remember to just ask the model."

Flo Crivello

Same here.

Anton

I have to keep that in my head.

Flo Crivello

100%. It's like we always say, exhausts the prompting solutions first, before we start moving to fancy stuff or something.

Anton

Are you doing anything just simply from the developer environment, or logistic side? How many engineers are working directly on the product right now?

Flo Crivello

We are 14 now.

Anton

14?

Flo Crivello

Mm-hmm.

Anton

So you have 14 engineers. How are you sharing the developer environment? How is each engineer accessing the LLM? Do you have different API tokens, is it one organization? Practically, how is this set up?

Flo Crivello

I think it's just run off the mill. I think it's probably a bad practice. I think we're using the same key for probably... No, every engineer's got their own key for development. And then we have one key that's used between staging and production.

Anton

Gotcha.

Flo Crivello

It's just the same key for everything.

Anton

And I guess I already asked you the cost question, but you're not really looking at the development expenses in terms of tokens.

Flo Crivello

No, it's not. It's not major.

Anton

Yeah. I mean, it's interesting, right? Because on the one hand, if Lindy is successful, there's going to be many, many more user tokens than developer tokens.

Flo Crivello

That is true.

Anton

Yeah.

Flo Crivello

Yeah, exactly.

Anton

So, that makes sense. One issue currently, and it's always improving, and again, we've got Flash, Gemini, Flash 1.5 and GPTO, is pretty quick. And the turbo models have been the first demonstration that could actually make LLMs fairly fast. Is inference time proving to be a bottleneck anywhere in development, in deployment, for users?

Flo Crivello

It used to be of more one when the models... I don't know if you remember, but GPT4 was slow.

Anton

Oh yeah, I remember.

Flo Crivello

The first few months of GPT4 was bad. It was so fucking slow, that thing. And it got faster and faster. Right now, a week ago, it got 2x faster again. So the speed is no longer be an issue, then at some point, the issue was the quotas, and the RPMs or TPMs.

Anton

Rate limiting.

Flo Crivello

The rate limiting was a huge issue. No longer really much of one.

Anton

What changed? Is it that the providers raise their rate limits? I still hit rate limits fairly often for some of my experiments.

Flo Crivello

We don't really anymore. Yeah, I think they just increased the capacity. I think both they got more GPUs, and I suspect the latest models that they're running, I mean they're probably smaller, and they've had a bunch of inference optimizations. Yeah.

Anton

Yeah. I think a lot of the platforms have been investing very heavily in inference optimization.

Flo Crivello

They have to.

Anton

Yeah, they have to. One thing again that I think about as well, is we're a little bit limited in the way that we run these models. And there's an efficiency ceiling. And earlier, what you said about this kind of, like, the build and run dichotomy is applicable here. Which is, we run training and inference on the same hardware. And we run them...

Anton

... training and inference on the same hardware, right? And we run them on GPUs. And GPUs are these massively parallel computing systems, single instruction, multi-data, right? But in order to get good efficiency from them, you have to batch everything, right? The whole way to use GPU infra efficiently is to keep the pipelines full. You never ever want any warp to not be processing something. So you're batching as much data as possible, but, as we talked about earlier, as we start pushing these things to be more dynamic computing elements, we're going to have to move away from batching somehow. I think about that a lot. I don't know what the answer is, but it's really, really hard to load and unload data from a GPU. So, I don't know what that's going to look like.

Flo Crivello

It feels, to me, like the batching usable data that is at a sufficiently high, low level, so even though the high level-

Anton

Yes. You don't the see it. Sure.

Flo Crivello

Right. Not only is that, but the GPU doesn't see it, even as the applications get more and more unique. It's just like, it doesn't really matter, I think.

Anton

Because it's just fast enough that even if the pipeline is kind of empty, it's just going to burn, pretty much. Yeah. I can see that world, too. I think it's an interesting one. What's missing between what Lindy wants to be today as an application and where it is? I don't necessarily mean from your side, because there's always more to build, but in terms of capabilities, affordances, tooling, APIs, anything, what's the wish?

Flo Crivello

Are you asking what we're hoping the ecosystem would come up with? Or-

Anton

Yeah. Or something that you find yourself constantly missing. Evals is that for a lot of people. A good replicable way to do evals for a given project is something that I think many people in the ecosystem are looking for. It sounds like you've solved that, but I'm curious about what's next beyond that.

Flo Crivello

Nothing that's out of our hands. I can't really say we feel super bottlenecked by any outside forces apart, obviously, from the model. Sure, we'll always take the better, cheaper, faster model that we can get, but apart from that, I think what's holding back agents in general today is still reliability, right? And so, I think there's multiple schools of thought here. Some of them are like, "Oh, we're just going to wait for the model to get better," and some of those schools of thought are already more like, "We are going to keep iterating on the cognitive architecture in order to increase this reliability." Our experience is that the same model with different cognitive architectures can have a reliability be 10X better. And so, we never struggle, throw our hands up, be like, "Oh, well, we'll just go wait for the P5 GPG Suites to come out," because we keep seeing that curve move up into the right of the same model with, as we keep making the cognitive architecture better and better, the same model keeps getting better and better and better and better. And we're not running out of ideas. We have a huge roadmap, and we know what to do next.

Anton

Has there ever been a change in the model where you're like, "Actually, we don't need this extra stuff anymore. The model can just handle this"? Has that happened for you yet?

Flo Crivello

Yeah, at the context window, for sure. I mean, we-

Anton

Of course. Yeah, yeah, yeah.

Flo Crivello

Yeah, yeah. So we used to have a huge part of the app the was called the Pruner, which pruned the context window using very-

Anton

And summarized and all this other stuff. Yeah.

Flo Crivello

Yeah. It was a very convoluted logic, and we just killed that thing as the context windows basically became infinite. And all of that stuff was really just pruning... Our main source here are just going to be ready through the Pruner. So for example, the model used to have a first pass where it would select the tools that it wants, and then the second pass where we would inject these tool signatures in this context window. We've removed that entirely. Now it just has all the tools. Yeah, a bunch of stuff like that related to the context window.

Anton

Nice. Yeah, it's interesting. It's interesting to imagine. It's almost like a Red Queen race, right? The model improves, which means that a lot of the stuff that you did previously for cognitive architecture is obsoleted, but now you're like, "Okay. Well, how do we make this model perform that much better?" Right?

Flo Crivello

Yeah.

Anton

Generally, actually, I think I'm on your side when it comes to, what do we do? Do we wait for the model to get better, or do we build today? One of the reasons that I'm on your side is because I actually think that direct experiences with how people actually use these things is extraordinarily valuable, and the sooner you get it, the sooner... When the new model does come out and does have more capabilities, you understand better how to deploy them because you understand what people want out of it. And yeah, okay, maybe we will one day, maybe any given product will one day get to a point where it's like, okay, well the model capability is now at a point where even the way we're asking people to use it no longer makes sense. But I'm not sure how long that really is, and I think that, again, encouraging more experimentation is what we ought to be doing.

Flo Crivello

Yeah. The way I think about it, and what I always tell the team, is we should optimize for the medium term, which is like a 18 months time frame, which I think is the right-

Anton

That's how I think, too.

Flo Crivello

... sweet spot for start-ups to optimize over. I also think we will always keep that in the back of our mind whenever we're building something, we're like, "How is this impacted by the march of models getting better? Is this made obsolete, or is this just an evergreen thing that we know, whatever the model capabilities, all this is just going to make them better?" Right? And we try, obviously, as much as possible to just work only on these things, except when, as was the case of the context window, this is a blocker and we just can't do anything here.

Anton

Yeah, can't do it.

Flo Crivello

And I find nowadays, let me think, I think almost every single thing we're doing now, we're like, "Yeah, we just need this, we want this, because it's just going to make any model better over the long term." Yeah.

Anton

I guess just to wrap up, and the final thing, what are you most excited about for the next six to 12 months in building in AI specifically, right? What future do you see as someone building with these systems that excites you the most?

Flo Crivello

I'm just so excited to see these agents come real, whether we are the ones making them real or someone else completely, we will. Look, it is going to be real. It's going to exist. It is going to be crazy. You are going to be in a world where you can talk to your computer, and your computer does stuff for you. How insane is that?

Anton

It's very cool.

Flo Crivello

You and I grew up at a time, you probably saw floppy disk just as the same as me, right?

Anton

Yeah.

Flo Crivello

And it's like, and now we can talk... The very thing that I was just saying about the manager engineering and the software engineer just talking together on a... I was looking at this, and I was like, "I can't believe this is where we are in 2024, where I have to look at my computer talk to itself and yell at itself to do stuff." Right? And that's just what's so exciting to me. We're going to see this deck be real.

Anton

It's so interesting that, first of all, I think AI is in this really interesting state. It's both overrated and underrated. I think people discount just how weird that is. It's weird that you can talk to your computer. People have just completely accepted it as normal, but I don't think people five years ago, if you showed them the same capability, they wouldn't believe you.

Flo Crivello

No.

Anton

There's no way. And then overnight it's like, "Oh yeah, of course I need to yell at my computer to get it to perform." It's crazy. It's insane. So I think it's underrated how unusual these things are. And I actually think one of our limiting factors currently, as I mentioned earlier, is we're not exploring how weird this is. We're not looking at how weird it is. There's a few people doing great work in this direction, and I really support them. I think they're going kind of slowly insane from interacting too much with these computers, but that's okay. I'm glad they're doing it. Somebody has to. It's like industrial hazards. And at the same time, AI is overrated, and AI is overrated because of these reliability issues that we face like the applications aren't totally clear. Also, frankly, we've marketed it wrong for the last three or four years. I think it was a huge mistake ever calling any of this generative AI. I think that ultimately these are general purpose information processing systems, right? Calling it just generative puts the wrong thing in people's heads.

Flo Crivello

I agree. I agree. I think to your point, I forgot who it is, I think it's Bill Gates who said people always overestimate the impact of technology over the short term and underestimate it over the long term. I think it's impossible to underestimate AI over the long term. I think, look, AGI is coming. It is just going to happen.

Anton

Yeah. One question that I like to think about is, okay, we have AGI, now what do we do? What do we do with it?

Flo Crivello

Right.

Anton

That's-

Flo Crivello

It's like the dog that caught the car.

Anton

Yeah, exactly. And it's like, okay, imagine you could have any employee that you wanted or any tasks, a thing that could just go off and perform arbitrary tasks. What's the limit? And the limit is, again, our ability to figure out what to do. It's no longer the capability, in many ways. And of course, there's open questions about, okay, well, what's it going to take to get there, and what's it going to look like when we arrive? If we have one AGI and it's super expensive to use and run, not economically viable, how long is it going to take to get that cost down to a point where we can just deploy them to every computer unknown? So there are dynamics in that, but I do like to sit around thinking about the maximalist case from time to time. I'm like, "Okay, what if we have these things?"

Flo Crivello

Yeah, I spend a lot of time thinking about that. What does it look like once people are no longer limited by bandwidth, effectively, by time, by bandwidth, by money, by team, by network, by all of that stuff? And this was the subject of a talk I gave last year, which is exactly that. It's like I think we have seen that happen before, right?

Anton

We have.

Flo Crivello

It's like, look, now have that with our phones and computers. You can create-

Anton

We have with the web.

Flo Crivello

100%, right? You can create anything and you can put it online. And if it's any good, people are going to find out about it. And I think we're going to find out the same thing where it's like today, certain entities in the world, the very big companies, have a lot of resources, they have a lot of time, they have a lot of horsepower in terms of engineering and marketing and all of that stuff. Soon we're going to find ourselves in the world where any random 15-year-old has the same capabilities as these companies and is going to be able to have the same impact on the world, and is going to be fully just bottlenecked by their own idea and their own ability to deliver it to the technology.

Anton

And we've seen that in waves, and this is a new one, and it's part of the reason why I love working in AI so much is there's just so much potential. I really do often compare it to the early web. And for me right now, it's like we're in the GeoCities era. We're in the static website era. As many chat with your documents applications as Chroma does currently help serve, for me, it's pretty obvious that this is very primitive because what it's doing is exactly the same as a webpage did. It takes some data and presents it to you in a new way, right? So, obviously, applications like process automation, like what you're doing, and agents. And I like to think about we're going to get these weird things that we can't even think of yet, and that'll take people exploring. Last question, what advice would you give to somebody who's really curious about it, they have some ideas, they want to build an application? Where would you tell them to start? Let's say that they know a little bit about programming. Maybe even they've done some serious software engineering in the past. What should they do?

Flo Crivello

Just get started. Just do it. I find the mistake I see people make most often is they do work about the work. They do work, or they plan the work. They do a Coursera course, or they go networking. It's just, build it. Just do the fucking thing, right? It's like, it's going to be dull. It's going to be buggy. You're going to use the worst technology. Just take every possible shortcut that you can to get some momentum going to get something that runs in people's heads, ideally, right? And don't ever think it, just build shit.

Anton

Yeah, exactly. No, I would agree with that. Great. Well, thanks a lot, Flo.

Flo Crivello

Yeah, thanks for having me, man.

Anton

Thanks for coming.

2024 Chroma. All rights reserved