Rahul Sonwalkar on Building Julius
Large language models are capable of capturing user intent in a common-sense way, and translating it into code. This new approach to software empowers many users who don't have the time or knowledge to write code themselves, allowing them to instead focus on the task at hand. Making sure that the translation is accurate is a challenge, requiring all-new tooling and techniques.
Rahul Sonwalkar, founder and CEO of Julius - the AI data scientist, joins Anton to discuss how they use large language models to write code, integrate LLM tool use, detect and mitigate errors, and how to quickly get started and rapidly iterate on an AI product. They also discuss how thinking about products in terms of 'GPT wrappers' kills creativity, and what kinds of experiments AI developers should be trying today.
Released June 26, 2024, Recorded May 25, 2024
Transcript

Julius

Rahul on X (Twitter): @0interestrates

Anton on X (Twitter): @atroyn


Anton

What you've just described is an AI driven software engineering process, which Julius AI is already using. You've adopted this so seamlessly. I think that's super powerful.

Rahul Sonwalkar

The model writes code, runs code, shows you a cool visualization about your data. And you want each new user who comes to Julius and there's thousands of new people who come in every day. We really obsess over showing them that wow moment in the fewest steps possible.

Anton

Again, the point of this series is to basically talk about the practical things about building with AI, right?

Rahul Sonwalkar

Yeah.

Anton

AI applications, AI software development, it's a totally new field. Even though it feels like it's moving superfast, it's like about 12 months old. The people who started building with AI only really started building in November '22, and today we're in May '24. It's just not that long. We're all figuring it out. A lot of people building come from a software background, and I think it's valuable to talk about that. But before we get started, why don't you introduce yourself and Julius?

Rahul Sonwalkar

Totally. I'm Rahul. I'm the founder and CEO of Julius. Julius is an AI data scientist that helps you analyze data sets, get insight from your data and create good-looking visualizations. We launched about 10 months ago, and since then have crossed half a million users, and as of this week, Julius writes and executes over a million lines of code every 48 hours.

Anton

That's incredible. I love that metric about the model writing code, and most people who haven't really played with this technology aren't even aware that the models can write code, right?

Rahul Sonwalkar

Yeah.

Anton

Maybe that's a good place to start. How do you get the LLMs to write code? What do you do? How do you make that happen?

Rahul Sonwalkar

Totally. In short, it's primarily prompting. We use models that are available publicly in most cases, and we realized at some point in time that these models were really good at writing code because they were helping us as engineers write code. And we realized that code in itself is a really powerful tool that needs to be democratized. There's a lot of people out there who could use the power of code, but just don't know how to write code or how to run it, so Julius helps them do that. Julius, you can just talk to it and we take the English commands and then turn that into code, run that code, and let the AI decide whether that output is sufficient or not.

Anton

You get that feedback loop built in almost with writing code as opposed to with language. You execute the code. Either it did what you want or through an error or something, right?

Rahul Sonwalkar

Absolutely. Let's say you ask an AI to write an essay, you don't have any feedback loop. It will just output the essay or an article, but with code, the beauty is that you can break a complex task down into chunks and progressively write some code, then execute it, see the output and decide whether this is the direction you want to proceed to tackle this task, or you need to try a new approach. Kind of like a human engineer would.

Anton

It's interesting. I always hesitate to apply a human or anthropomorphic theory of mind to these models, but it is helpful as a shortcut a lot of the time, right?

Rahul Sonwalkar

It is helpful as a way to build the right products.

Anton

Because thinking it in that way at least helps you think about how would a human use this, right?

Rahul Sonwalkar

Yeah.

Anton

Let's drill down a little bit more and getting models to generate code. How does that look like from literally calling to the model? How does it work? Do you just send an English instruction, like generate the code in Python?

Rahul Sonwalkar

Great question. One of the good things about these models is that they were meant to be general purpose models, and this makes them super versatile. You can use a model to write articles and do creative writing, but you can also get it write code, but that's also challenging when you only want to write code and you want it to focus on writing code, debugging code for a particular use case.

Anton

Because the model typically returns text. We've all had the experience with ChatGPT, where it's returning way too much commentary and all you want is the answer, right?

Rahul Sonwalkar

Yeah, absolutely. What we do is we heavily use tool use. Tool use is you're giving the model a prompt and you're saying, "Hey, this is who you are, this is the kind of tasks you're going to do, and here are a bunch of tools that will help you achieve the task."

Anton

And that's a specific API. It's like that a field... Because typically with prompting, you have the role field on each message. So you have the system prompt like you're describing. It's like this is who you are and what you do. Then you have a user role, which is this is input coming from the outside that a user might be putting in. Does tool use live in the role in the system? Where do you put these tools?

Rahul Sonwalkar

Great question. Today, as the recording of this video, both ChatGPT or GPT-4 and Anthropic Claude, they support tool use in their API. There's a field called roll, which is system, user or assistant. Then there's content, which is the actual text you're passing to the model. And then there's a new field, it's either functions or tools where you're describing in simple English what the tool is supposed to do. Your tool could be anything actually. You could in English describe whatever you want the tool to be, and then you also describe what the output or the input for the tool should be.

Anton

As API for that tool almost, right? The arguments to a function, hence function calling.

Rahul Sonwalkar

Exactly. The arguments to a function, hence function calling. Let's say I'm making an agent to schedule meetings and I can say, "Hey, you are an agent. Your name is Anton. You schedule meetings, your input will, the email, which has a bunch of times, and the tools available to you are check calendar, create invite, and send email." And then you can take that and give it to the model, and then model can parse your email and then decide, okay, it looks like the participants in this meeting have these availabilities. Let me check the calendar for the person who called me, and I'll call the check calendar tool. Now, you have to think about what are the parameters I would want the AI to give me so they can successfully pipe this output into my Google calendar API. You can say date, time, all those could be your inputs to check calendar and get the availability for the slots. And then the next would be a different tool called create invite or send invite, where you can define the participants, you can define what the time slot is, what is the email of each participant, and what's the event name, event description. All these could be parameters that you're telling the AI you need in order to use this tool. This is how tool use works.

Anton

The model basically is capable of figuring out how to use the tools in context, and then what does it emit when it decides to use a tool? Because again, we're used to the model emitting English text, but now it sounds like you want to emit a function call. Is there a specific field on the output? Do you have to find it in the text, the model emits?

Rahul Sonwalkar

One quick caveat there is off the top of my head, both Anthropic and OpenAI support function calling natively in their API, but there are ways you can prompt engineer this into other models as well.

Anton

Sure.

Rahul Sonwalkar

For the longest time, the way we did it for Anthropic was just put into the prompt a bunch of functions and tell the model, "Hey, here's the tools you have and you can use these tools." And then put very specific instructions on what text to output when it's using the tool, and then use some sort of Regex to parse that in the model's output as indication that the model is trying to use a certain tool.

Anton

Of course, then you're stuck with the problem of running a Regex over unstructured human language, which is... It's one of the biggest problems that we have is the flexibility of the models are a blessing and a curse. It's very difficult to actually definitively constrain them.

Rahul Sonwalkar

True.

Anton

Let's talk a little bit about what happens. Let's go back to our calendar app example, and then we'll unroll this all the way back to Julius. You've extracted that the model says here, make a function call. Now we're exiting the model execution environment and we're back in the, let's call it traditional software execution environment. What happens next?

Rahul Sonwalkar

I CC my agent on my email and say, "Hey Anton, I want you to schedule a meeting." And the output, the way I would build this application is I would pass the entire text of the email thread into the model whenever CC'd and I would present to it a few different tools. I would present to it, check calendar, send invite, send email.

Anton

Suppose it calls that check calendar to check calendar tool in the way that we described, but it's not actually... It doesn't have access to the tool itself. You have to call that for the model. How does that happen and what happens next?

Rahul Sonwalkar

Great question. We want this model's output to trigger a chain of deterministic actions that we can do and use the model as a decision maker to send the action.

Anton

But how do you literally... Behind that check calendar, there's got to be some actual traditional functions, some code?

Rahul Sonwalkar

Yes. The way you would do that is when the model streams the output back, you look for the function being called. If you're actually just Regexing, you look for the Regex, or in other cases, when-

Anton

When it's native.

Rahul Sonwalkar

... it's native, the role will be function or the role will be assistant, but then there's a new field type, which is function.

Anton

Oh, because it could be type content or type function.

Rahul Sonwalkar

Yeah.

Anton

Got you. Makes sense.

Rahul Sonwalkar

This is off the top of my head. And then in the content, it would actually... When it's calling a function, it will give you the arguments that you have told it.

Anton

And then you pass the arcs and then it calls the function.

Rahul Sonwalkar

And then you parse those arcs, you pass them into a deterministic code function. Let's say you have an integration with Google Calendar, and you would say, I need these and these arguments as required fields. And then you can also specify optional fields. In this case, the required would be, sure, you want to check calendar, but what day do you want to check the calendar for? What are the days?

Anton

And the model's populating those arguments, and you send it along and you get some response from let's say the Google calendar API, and it's up to you to process that. Unrolling the stack, going back to Julius a little bit, obviously Julius isn't about just piping function calls through a UI. There's a lot more to it. And you've already talked about how there's multi-step reasoning built in because you want to support people who don't really want to understand how to do the function calling themselves. Let's enroll that a little bit. We get output from a function that we call deterministically based on arguments that the model is passing us and say, "Hey, I want you to call..." Basically what the model is saying is, I want you to call this for me. You go out and you call it, maybe you process it some more. But in this multi-step environment, like with Julius where you're asking to analyze data, we have to also provide feedback about the result of that function call to model. And I imagine chaining those gets complicated, but what do you present back to the model after it's executed that? Do you tell it, okay, that succeeded, that failed? What do you do?

Rahul Sonwalkar

When this is natively supported in the API, your life becomes a lot easier because the native APIs, they support-

Anton

I hope Google heard that.

Rahul Sonwalkar

... the function. They support function output as a new field in the API, so you can present the output of the function back to the model. In the case of calendar-

Anton

As a user role, prompt function outputs is the field?

Rahul Sonwalkar

I think the role in that case is function.

Anton

Oh, interesting. You can put... Oh, okay, I got to try that. I haven't tried it yet. Because I still live in that world where you're Regexing the outputs. I haven't tried the full fat APIs yet.

Rahul Sonwalkar

If you're Regexing it, you would probably say in a user prompt, you would say-

Anton

This was the output.

Rahul Sonwalkar

This was the output.

Anton

Got you.

Rahul Sonwalkar

I ran that code. This was the output. What should I do next?

Anton

Okay, makes sense. And then the model-

Rahul Sonwalkar

You'd run it in a loop. And my hope in the calendar example is after checking the availability, the model will probably decide to call, send invite.

Anton

Got you. Let's talk about Julius, because we've talked about a very specific task performing model, but Julius's general purpose. Julius is designed for... Mostly right now people are using it for data analytics, if I'm right, data science?

Rahul Sonwalkar

Yeah.

Anton

Obviously, data comes in all different shapes and sizes, it comes in different sources, and the tasks you want to perform with it are super, super dynamic. We've talked about these three tools. What are the tools that Julius presents to the model?

Rahul Sonwalkar

Great question. When we first started, it was one tool, which was run Python code. Because Julius was built over a period of two or three weeks. It was very bare bones when we launched, we didn't have the time to, even in the UI, use any design. It was for all the icons, we would just use emojis. It was-

Anton

That's the trend now, right?

Rahul Sonwalkar

Yeah. But when we launched, it was-

Anton

Right?

Rahul Sonwalkar

Yeah.

Anton

Yeah.

Rahul Sonwalkar

But when we launched, it was just one or two function calls. But what we realized as we-

Anton

Wait, which function calls did you have at launch?

Rahul Sonwalkar

It was, I believe, off the top of my head, it was read file and the other was run Python code, which was kind of like a catch all-

Anton

Okay. So in the response to that run Python code tool, the model would put Python code in there for you to run in your interpreter?

Rahul Sonwalkar

Exactly.

Anton

Okay. That's primitive. That's early.

Rahul Sonwalkar

That's early. And [inaudible 00:14:29] for effort back then was spent towards setting up the infrastructure for the code interpretation. So for context, each user on Julius gets their own sandbox. You can upload your files to the sandbox. It has some CPUs, some memory, and the model can freely run code in the sandbox and do things and iterate in a loop and do the task you give it. So bulk of the effort was spent towards making those secure, getting those up and running, and then building the interface. I think that's one of the things about building applications is you probably want to innovate on things that will help you build a better product right on the bat and then get users and then iterate.

Anton

Yeah, that's what I always say for people building with AI today. And you and I have talked about this many times, and I mentioned it also previously when I spoke with Flo, but sometime in early '23 people started using this phrase GPT wrapper.

Rahul Sonwalkar

Yeah.

Anton

And I think that's such a mind killing phrase. It immediately kills thinking and creativity about what we actually can do.

Rahul Sonwalkar

Yeah.

Anton

Right? And I think that people still don't understand what people want out of software or something that's running on their computer is they want it to perform a defined task.

Rahul Sonwalkar

Yeah.

Anton

That's what they care about, right? And they don't want to think about how do I wrangle this general purpose system into performing this specified task for me, right? Which is essentially like, Julius smoothed that process for data science for a lot. Right? So I really want to highlight at this part, how much contact with the real world user actually really matters with building with AI, right? You have to figure out what really matters, what they care about. And there were surprises for you along the way as well, right? Julius has really taken off, if I remember, in the education context right now too, which is super, super cool.

Rahul Sonwalkar

Absolutely. So when we first launched we thought the users for this would be data scientist and data analysts. Turns out most of the users are people who have data on their hands and have a lot of curiosity about that data, but they like the expertise to do the data science or data analysis on it. And turns out it's just the market is crazy.

Anton

It's so empowering, right?

Rahul Sonwalkar

Yeah.

Anton

AI is empowering because it puts all of these things within reach. So despite the fact that I'm building a software company, I hate using computers. I don't like the fact that the problem I'm actually trying to solve forces me to go through this bottleneck of writing code. So now I have to learn how to write code and use the computer appropriately to complete the test that I actually want to complete.

Rahul Sonwalkar

Yeah.

Anton

So I can imagine being a scientist and being very frustrated by this because you have this rich complex world that you understand very deeply and you're only able to express it. It's like a foreign language. It's like when you're learning a language at the start, you can only express yourself very poorly in it. So imagine you have these great ideas in science, but your only language is code, which you don't understand very well.

Rahul Sonwalkar

Yeah.

Anton

Julius gets that out of the way, which is super, super cool.

Rahul Sonwalkar

Thank you. But absolutely. So to your point about GPT wrappers, it really depends on who's asking. I feel like there's some people who ask from a place of good faith, and they're just genuinely curious, whereas this use GPT or what model does it use? And for them I have a really honest answer like, hey, this is what we do and this is how we built it. And I actually want more people to build really cool AI applications.

Anton

I'm the same with you. That's why we're talking.

Rahul Sonwalkar

Yeah.

Anton

Julius should be an inspiration to many people.

Rahul Sonwalkar

Yeah. There's so many people who don't get started because they want to build something specialized and really complex.

Anton

Yes.

Rahul Sonwalkar

So there's this whole thing about-

Anton

Classic trap.

Rahul Sonwalkar

It's a classic trap. It's a nerd trap where I want to look cool to my colleagues or my friends. And this is a running joke I have with Tristan who is the founder of Readwise. And he's in Toronto and they do a lot of stuff with AI and I told him, "Dude, one of the lowest status things you can do in San Francisco today is build a GPT wrapper." Because you don't want to go to a party and say, I'm building an AI application. And they ask me, so are you training your own model? And you say, no, we are actually using the best models for what they're good at. And then building a product with it.

Anton

Yeah. Low status stay is winning.

Rahul Sonwalkar

Low status stay is winning.

Anton

Low status stay is winning.

Rahul Sonwalkar

Yeah. Yeah. Ultimately, our users don't care what model we use.

Anton

That's right.

Rahul Sonwalkar

Our users really care about what problem we solve for them.

Anton

Yeah.

Rahul Sonwalkar

And as long as we, whether it be training our own model is the best way to solve that problem, we'll do it. Whether it's like using somebody else's model to solve the problem, we'll do that.

Anton

Yeah.

Rahul Sonwalkar

But ultimately we are building something that people want.

Anton

It's honestly kind of the same for Chroma. I mean, when we launched people were like, "What's difficult about [inaudible 00:19:28] nearest neighbors search in vector space?

Rahul Sonwalkar

Yeah.

Anton

It's like, that's not our job. Vector search is the way that we perform our job today. And scalable vector search is something that's necessary to perform our job. But what Chroma is really for is getting data to the models so that you can build applications with data that the models can then process for you. That's what we do. And if there was a different way of doing it, I would just do it that way.

Rahul Sonwalkar

Yeah.

Anton

So it's similar. It's what does the user actually care about in building your application? I want to go back to a little bit more of the technical details because we got started, we've talked about this one iteration of this loop, right?

Rahul Sonwalkar

Yeah.

Anton

And we've talked about the happy path where our agent manages to perform the task and gets the feedback that, hey, you're making progress.

Rahul Sonwalkar

Yeah.

Anton

You mentioned at the start of our conversation here that Julius does this multi-stage.

Rahul Sonwalkar

Yeah.

Anton

So let's first talk about that. Obviously there's the very rigid way of doing things, which is like first perform this and then if you succeed, perform that.

Rahul Sonwalkar

Yeah.

Anton

But that's not what Julius does, right?

Rahul Sonwalkar

Yeah, absolutely. So there's some degree of freedom we give to the model. And one of the things, one of the little secrets about building Julius is we figured out pretty early on that you can actually hack the shit out of... Can I cuss on this podcast?

Anton

You can cuss on this, you can do whatever you want.

Rahul Sonwalkar

We figured out pretty early on that we could hack the shit out of two [inaudible 00:20:50]. And I'll share a couple examples of how we do that. But it's sort of like these models are really good at, they are becoming good at picking tools to use.

Anton

Yes.

Rahul Sonwalkar

So the more specialized tools you make, the better they become at performing tasks.

Anton

That's very interesting because that's a long way away from giving it a Python interpreter, right?

Rahul Sonwalkar

Exactly. When you give it a Python interpreter, sure, we'll give you a Python code, but when you give it a tool that says fix this specific kind of error in the code, in the Python code, it suddenly becomes a lot better at fixing that kind of error.

Anton

It's like getting in the model to focus even though it's general purpose.

Rahul Sonwalkar

Exactly. Get the model to focus even it's general purpose. And we only came to this realization after we got a lot of users and we noticed patterns in which the model would succeed, and patterns in which the model would fail. And you would realize, okay, these are the common patterns and these are the common errors that happen in the code when you take the code-

Anton

What are some of those? Would you share some of the concrete ones?

Rahul Sonwalkar

Yeah, a very simple one is module not found. You give it a Python interpreter and you tell it, hey, the user's giving you this task, run code to do the task. And it tries to import some modules, run that code and it realizes, oh, there is no module. This module doesn't exist in the environment. Now there's a couple of different things that happen. One is you can simply reinstall the module, but you want to make sure that the version of the module is compatible with other packages in the dependencies and there is no conflict issues. There's another example of this is fixing name not found. So what we realized is when users upload data sets, the data sets are really messy. So sometimes the model, it's very limited understanding of what the user's dataset is. It tries to write code and run the code and it runs into a bunch of error. And when we realize through some telemetry that, oh, this error is related to the dataset, we give it functions like clean dataset, which has very... So it's not just the name of the function. The function has a field called description where you can give it really, really specific descriptions and instructions on, hey, try these five things because that's where most errors tend to be.

Anton

So let's focus on this data cleaning use case.

Rahul Sonwalkar

Yeah.

Anton

So we've identified that we need to run the data cleaning tools.

Rahul Sonwalkar

Yeah.

Anton

You've presented the model with the tool use API, basically the tool use fields that it then goes and populates, right?

Rahul Sonwalkar

Yeah.

Anton

But the model isn't cleaning the data itself. So is it specific, let's say a Python function that you know these are common errors in the dataset or what is it?

Rahul Sonwalkar

So an example of this could be you have a column called revenue.

Anton

Yes.

Rahul Sonwalkar

And the revenue has, sometimes it has a dollar sign, some of the numbers have dollar signs, some of the numbers are just numbers.

Anton

Yeah.

Rahul Sonwalkar

And this really confuses the model because you can give it like a [inaudible 00:24:01] dot head and it will give you top five rows. But when you have 1,000 rows and-

Anton

Yeah, that dollar sign could be on row number 700 and it's the only place that's in the entire table.

Rahul Sonwalkar

Exactly.

Anton

Yeah.

Rahul Sonwalkar

So you can give it instructions like check if the values in this column are consistent, and if they're not, find the values that are not consistent. Write code to find the outliers or values that are not consistent.

Anton

So those functions are dynamically generated.

Rahul Sonwalkar

So we check what column the error happened in. Let's say the model tried to create a sum of revenue like what's my total revenue?

Anton

And then some error got spat out.

Rahul Sonwalkar

So error, the error lock says this column had a name error or type error. This case is type better. So we check that and we say, okay, the column is this, type error is this, let's write dynamic instructions for the model to inspect column. Function is called inspect column. You give it-

Anton

But this inspect column, this is something that you have for want of a better word, hard coded somewhere because it's so common, is that right?

Rahul Sonwalkar

So in some cases, yeah, [inaudible 00:25:13] is not hard coded, but there are other cases where their function is just purely hard coded in just pure Python code.

Anton

Yeah.

Rahul Sonwalkar

In this case, we are giving them a little more freedom of hey-

Anton

How to implement. But what I mean is is this task of inspect column is so common that you know to inject that into the prompt already basically, right?

Rahul Sonwalkar

Yeah.

Anton

You're like, okay, this is an error we see 50,000 times. Ask the model to now specialize to a model that's only cleaning, only inspecting columns.

Rahul Sonwalkar

Exactly.

Anton

Right? Is that right?

Rahul Sonwalkar

Exactly. So the agent completely changed from doing this general purpose to doing this very specific thing, which is at this point in time, the only thing I care about is inspecting the column and fixing the type error in the column.

Anton

Right. Hence how we get to this multi-stage thing now. Right?

Rahul Sonwalkar

Exactly.

Anton

Yeah.

Rahul Sonwalkar

And you can sort of remove other tools from the context.

Anton

Yes.

Rahul Sonwalkar

And add more tools to the context that are really specific to what the model [inaudible 00:26:09]. And you're spot on about the 50,000 times. That's exactly what we do is we collect those patterns on like, oh, this error happened X thousand times this week. How do we fix that? An example of this was November 7th, when GPT-4 Turbo launched, we realized that our file import errors just skyrocketed. And turns out OpenAI in their training data, they collect a lot of data. The file pads are /mnt/data in the prefix. So all file pads are /mnt/data.

Anton

So it thinks all data has this /mnt in front of it.

Rahul Sonwalkar

Yeah.

Anton

Yeah.

Rahul Sonwalkar

And then we tried to fix that with prompt engineering. We tried to tell it in the prompt.

Anton

Ignore mnt, look at the actual path.

Rahul Sonwalkar

Yeah.

Anton

Yeah.

Rahul Sonwalkar

That sort of brought it down, but still-

Anton

Did you end up just stripping it out of a pure string substitution?

Rahul Sonwalkar

That's what we ended up doing

Anton

Hell yeah.

Rahul Sonwalkar

Is after we get the function output, the model would just keep putting in the function and we would detect that and just remove it from the string. It was funny, but-

Anton

No, but this is the thing, right? Yes. Maybe there's a future in which the models are perfect and can figure out intent with a very high degree of precision.

Rahul Sonwalkar

Yeah.

Anton

But in the world that we live in today, we have to really compensate for the unpredictability and limitations. So let's talk about, before we get to, because there's so much fascinating stuff here, you've given an example of one of the more unusual complications with building with LLMs, which is like the model can drift under you. We're used to it in software, someone pushes a package somewhere, API compatibility is broken. It's like five points down the tree.

Rahul Sonwalkar

Yeah.

Anton

But here again, because the models are sort of flexible, it's like did something even change? How do you look for that?

Anton

... they're flexible. It's like, did something even change? How do you look for that? Obviously you guys have got aero telemetry, but...

Rahul Sonwalkar

Yeah, totally. That's been an ongoing thing that we are always trying to improve. And I don't think we will have a solution that would solve it for us for all future-

Anton

You just basically have to stay on top of it, right?

Rahul Sonwalkar

You just stay on top of it. There's a couple things you can do. Right now we're investing in evals. For the longest time, our evals were vibes where-

Anton

Real.

Rahul Sonwalkar

... for a really long time, where evals were all vibes. Where we realized that, as long as we are talking to a lot of users and we have this type of feedback loop with our power users-

Anton

You're on the phone with them.

Rahul Sonwalkar

Yeah. So one of the things users, if you pay for the pro plan on Julius, you get my phone number. And usually it's all nice things. People text me candid feedback like, "Hey, it would be great if we could have this."

Anton

Yeah.

Rahul Sonwalkar

But about once a month we go down for a few hours and then they call me at midnight. "I'm trying to finish this work and this is down!"

Anton

"I've got this assignment due!" Yeah.

Rahul Sonwalkar

Yeah.

Anton

I get some of those, but via email. Anyway, so yeah, you talk with the power users. You were talking about evals.

Rahul Sonwalkar

Yeah. That type feedback loop got us pretty far. And I think one of the things about building applications or AI applications is there's a lot of noise of people that will tell you, "You need this to build an application."

Anton

Yes.

Rahul Sonwalkar

"You need evals."

Anton

Yes.

Rahul Sonwalkar

"You need the best this and you need... and before all this, before you-

Anton

Which is, again, it's a classic software trap.

Rahul Sonwalkar

Yeah.

Anton

I'm always inspired by Peter Levels, who's like, the guy just cranks out websites and web apps. Right?

Rahul Sonwalkar

Yeah.

Anton

People often ask him what frameworks and whatever he uses. He's like, "No, man, it's just a raw PHP and jQuery. I built these multimillion dollar businesses with raw PHP and jQuery."

Rahul Sonwalkar

Yeah.

Anton

It's literally like understanding the user is what's the important thing.

Rahul Sonwalkar

Yeah.

Anton

It's very inspirational. If you need inspiration about how you should build software that people actually really care about, just go and look at some of the stuff that Peter built.

Rahul Sonwalkar

Absolutely. He is a big inspiration. I think he's a little crazy for using PHP and jQuery, but I respect that.

Anton

Look man, it works, right?

Rahul Sonwalkar

I respect him for doing that.

Anton

Yeah.

Rahul Sonwalkar

None of us on the team are front end engineers, but we build front end and we use Workshell for that.

Anton

Yeah.

Rahul Sonwalkar

And there's-

Anton

There's tools out there that you could plug together and just use, right?

Rahul Sonwalkar

Exactly.

Anton

Don't overthink it.

Rahul Sonwalkar

Don't overthink it.

Anton

Yeah.

Rahul Sonwalkar

I think, in case of Peter levels, it's more like he just knows PHP and jQuery. Why would he learn a new thing?

Anton

Exactly. And he's fluent in it which is again like, why would you learn another language? Why would you download how another person thought about this problem when it's not the problem you're trying to solve?

Rahul Sonwalkar

Yeah.

Anton

Anyway, back to evals. So now evals matter because you have enough... when did you decide to make that jump?

Rahul Sonwalkar

We decided to make that jump around the time when we started supporting a second programming language, which is R.

Anton

Yeah. Pretty recent, right?

Rahul Sonwalkar

Pretty recent, yes.

Anton

Cool.

Rahul Sonwalkar

So our users for the longest time emailed us and said, "This is awesome. This is so helpful. Could you support R?" And we didn't get it because none of us on the team knows R, and we just told them, "Guys, you can do the same thing in Python."

Anton

Python, yeah.

Rahul Sonwalkar

But turns out that-

Anton

They don't speak Python.

Rahul Sonwalkar

They don't speak Python, and a lot of their other work is in R or the people they work with use R.

Anton

Yeah.

Rahul Sonwalkar

So we launched R kernels on Julius. Now Julius can also write and execute R code. And when that happened we realized, oh, the dependencies we've got to manage are 2X. We've got to manage the R kernels and Python kernels.

Anton

Yes.

Rahul Sonwalkar

And now we have thousands of users using it every day. We could unintentionally break things and not know about it.

Anton

Yes.

Rahul Sonwalkar

And our testing matrix is just doubled now because we have to have twice as many tools, we have to have-

Anton

Well, it's a combinatorial explosion all the time.

Rahul Sonwalkar

Yeah.

Anton

And it's like, given what you've said about building specialized tools to fix errors specific to, for example, importing Python modules, as soon as you introduce another language in your model, that means more specialized tools.

Rahul Sonwalkar

Yeah.

Anton

Yeah.

Rahul Sonwalkar

Exactly. So that's when we decided to build evals. And it's an ongoing project.

Anton

Of course. It never ends.

Rahul Sonwalkar

Yeah. It's a thing we have to collect over time. You just collect evals, and how do you use evals. But that's sort of how we do it reliably.

Anton

What are some of the evals that you run?

Rahul Sonwalkar

A pretty common one is just end-to-end integration, which is, when a user uploads a data set and you ask a simple question like, "Show me something interesting about that data," can we get the user to a magic moment as soon as possible?

Anton

Right.

Rahul Sonwalkar

Which is the model writes code, runs code, shows you a cool visualization about your data.

Anton

Yes.

Rahul Sonwalkar

And you want each new user who comes to Julius, and there's like thousands new people who come in every day, we really obsess over showing them that wow moment in the fewest steps possible.

Anton

Yep.

Rahul Sonwalkar

And so that's a pretty common thing that we want to make sure always works. There's some crazy things. For example, we have a user in the UK who does TensorFlow stuff on his phone, so he doesn't even use the Julius web app. He uses the phone. And he's started to do some model training on his phone because he just loves how you can do it just with English. He's actually a data scientist. And one of the things is, because he's such a power user, we love supporting him.

Anton

Yeah.

Rahul Sonwalkar

So do those dependencies break when we add a new package to the environment by default? And there model. The models themselves, they change. So the models, they have a knowledge cutoff date.

Anton

Yes.

Rahul Sonwalkar

So when you're using GPT-4-0314, which is a year and two months' old model, its knowledge cutoff is 2022. And 2022, there was a different version of Pandas or TensorFlow. And that's the version that the model remembers.

Anton

Yes.

Rahul Sonwalkar

Now, you can either do two things. One is you tell the model to use the latest version that it doesn't know, or you make restrict your environment [inaudible 00:34:38]. And that complexity also increases when you want to support multiple packages.

Anton

You also have to online learn what the model knows about what packages are available at any given time, right?

Rahul Sonwalkar

Yeah.

Anton

Because it's strongly a function of not just, when did this package start to exist? But how common was it in the training data, right?

Rahul Sonwalkar

Yeah.

Anton

Yeah. What a pain in the ass.

Rahul Sonwalkar

Yeah, it is a real pain. And it became a real pain with R because R packages are pinned to the version of R.

Anton

Yes. It's a whole different package management metaphor.

Rahul Sonwalkar

Yeah.

Anton

Yeah.

Rahul Sonwalkar

So making sure. We have a lot of evals around that.

Anton

Good.

Rahul Sonwalkar

Do our dependencies work? Are they compatible with the model? Because let's say you threw in GPT-4-L and the knowledge cutoff date is late 2023. Now we have to go back and recheck what packages-

Anton

What packages are available to the model?

Rahul Sonwalkar

Yeah.

Anton

Yeah. That's great as an eval. Do you guys have anything like unit testing for a specific specialized tool, or are you still going by vibes there?

Rahul Sonwalkar

We put in evals before we put in unit tests.

Anton

Okay.

Rahul Sonwalkar

Yeah.

Anton

No, but in some ways it makes sense, because the main determinator of whether or not your application is functioning correctly is actually the model itself.

Rahul Sonwalkar

Yeah.

Anton

It's the stability of the model. Speaking of which, there's two more questions I want to ask. The first is, these things are inherently non-deterministic. Are you running at temperature zero? Temperature obviously determines how deterministic the model's text completion is going to be given the same prompt. Are you usually running at temperature zero? A lot of people are.

Rahul Sonwalkar

Great question. We're running super close to zero.

Anton

Close but not equal to. Why not?

Rahul Sonwalkar

We just, from vibes, we thought it did better. I think we run 0.1 or something.

Anton

Okay.

Rahul Sonwalkar

One thing we need to, and this is a good idea I just got, which is, with error correction, we might want to try a slightly higher temperature. Let's say you're in the fourth loop of trying to fix an error.

Anton

Yeah. You don't want the same output again. I've seen this many times.

Rahul Sonwalkar

Yeah.

Anton

A while ago I did my little theorem prover project, and it would routinely get stuck. And it would try to fix one error and then end up back at the first error.

Rahul Sonwalkar

Yeah.

Anton

Create a new error, try to fix that error, end up back at the first error, and just loop forever.

Rahul Sonwalkar

Yeah.

Anton

So that might be a good way to escape. Try to extend the temperature.

Rahul Sonwalkar

Yeah, exactly. So extend the temperature, get it more creative. Yeah.

Anton

So that actually brings me very naturally to my next question, which was, we've talked about how we arrive at success, but obviously things fail pretty frequently. First is, how do you detect that failure given that things are so... and we talked about evals, where you can obviously see a regression that's global, like for example with the file strings, but obviously there are local failures, like for whatever reason the model fails to perform the task. How do you detect that failure?

Rahul Sonwalkar

Great question. There's this whole thing about deterministic part of software and non-deterministic part of software. These models are inherently non-deterministic but there are certain parts of the application that are super deterministic, which is what unit tests for.

Anton

Yes.

Rahul Sonwalkar

And one of the reasons we don't have unit tests is, usually when the deterministic parts break, we do get pinged about errors in the system. Like, "Oh, there's an exception thrown," or whatever.

Anton

Yes.

Rahul Sonwalkar

But when the non-deterministic parts fail, it's hard to get that done.

Anton

Yes.

Rahul Sonwalkar

It's really hard. Besides, you realize that your evals have gone down or lot of-

Anton

Users are complaining.

Rahul Sonwalkar

Users are complaining. So one of the things... and your question was, how do you detect that?

Anton

Yeah. I can imagine, for example, you've got some data set, like the users ask Julius to do something for them and it gets stuck in these error-correcting loops, right?

Rahul Sonwalkar

Yeah.

Anton

Among the tools that it has and the outputs that it's producing, it's not making progress, really. How do you detect it and what do you do about it?

Rahul Sonwalkar

So we make it super low-friction for the users to give us pulse on the system, and then that helps when you have users because you can collect telemetry from that.

Anton

Yes. Yeah.

Rahul Sonwalkar

So one of the things is, after every third message from the AI, we ask the user to rate the performance as if you would rate an Uber app.

Anton

Is that a compulsory rating or is it optional-

Rahul Sonwalkar

Optional.

Anton

Okay.

Rahul Sonwalkar

Optional. Yeah, so it gives us insight into, when things go wrong, we can see five stars drop and one stars shoot up.

Anton

Yeah.

Rahul Sonwalkar

And usually it is... we tried thumbs up and thumbs down and it didn't work as well. Users didn't use thumbs up and thumbs down, but when we put the stars in-

Anton

What does three star mean, if someone gives you a three star for the outputs? Mediocre?

Rahul Sonwalkar

Three stars are the most rare.

Anton

Yeah, for sure. It's interesting about the distribution, right?

Rahul Sonwalkar

Yeah. People, by default, put five stars, and then when things go bad they put one star. And we want to make sure that, by the third message, we get as many five stars as we can compared to one star and their ratio is as low as possible. So that helps us.

Anton

Yeah.

Rahul Sonwalkar

Another thing we do is when... we aren't allowed to look at users' data. That's just a policy we have with our users, is we don't look at your data.

Anton

It makes it harder to debug. It's also an issue for us. We're a data store, but we want to be able to help you without you having to ship us the thing you're having a problem with. And we're building tooling that's going to make that a lot easier pretty soon, but I'm curious to hear how you deal with the same problem.

Rahul Sonwalkar

Absolutely. So all the data you upload to Julius gets deleted after an hour of inactivity because we just literally destroy the code sandbox and nothing is retained besides the chat history. And you can also delete that anytime you want. But this is a policy we have with our users. So one of the things we do is we actually use smaller models that are cheaper to-

Anton

What are some of those, by the way? Just in that category.

Rahul Sonwalkar

Haiku is pretty good.

Anton

Claude Haiku?

Rahul Sonwalkar

Yeah. We used to use 3.5, which is also really good. Haiku I think is between 3.5 and 4, and it's pretty cheap that we can do this for thousands of tasks a day.

Anton

Small models for specialized tasks seem to perform very well.

Rahul Sonwalkar

Yeah.

Anton

Yeah. So what were you using Haiku for? Sorry.

Rahul Sonwalkar

We use it for classifications. So we use-

Anton

"What type of error is in this chat?" For example.

Rahul Sonwalkar

Yeah.

Anton

Great.

Rahul Sonwalkar

Yeah. Or getting an understanding of... so yeah, when you just upload the data, is it structured properly?

Anton

Yeah. Just now we've made an almost imperceptible transition because I think you and I are so used to working with these tools as part of our natural workflows, but I really want to highlight this part. What you've just described is an AI-driven software engineering process, which Julius AI is already using.

Rahul Sonwalkar

Yeah.

Anton

You've adopted this so seamlessly.

Rahul Sonwalkar

Yeah.

Anton

I think that's super powerful. It's very, very powerful. And we've done some of that for Corma. We looked at people's questions and answers about where they were struggling and where their product surface needed to be improved. But when you're working with the models on a day-to-day basis it becomes almost a natural second thing, to do that.

Rahul Sonwalkar

Yeah, absolutely.

Anton

Yeah.

Rahul Sonwalkar

You're still spot on. AI-driven software engineering.

Anton

Yeah. You're using AI in a software engineering process.

Rahul Sonwalkar

So true. There's so many features in Julius that are-

Anton

Hearing process.

Rahul Sonwalkar

So true. There's so many features in Julius that are the backend for the feature is literally a model call. So we never really wrote the logic. So an example of this is when you can always export your data at any point in time, whatever the work the model's done on the dataset, you can click an export button and it will export that data. Now, there's two ways we could have built this. One is we could have built-

Anton

For every data format, write an exporter. Yeah.

Rahul Sonwalkar

Yeah.

Anton

Huge pain.

Rahul Sonwalkar

Huge pain. Or what else we do is we put the export button, which basically tells the model-

Anton

It's a tool use call, right?

Rahul Sonwalkar

It's a tool use call that tells the model, "Hey, write code and run that code to export the data that corresponds to this code block or this output."

Anton

Yeah.

Rahul Sonwalkar

And the model just does it, and it's pretty well, it does it pretty well. We have a graph editor that's entirely GPT-4 based where you can sort of click, edit graph. And it takes the graph generated by the model, takes the code that generated the graph, and then comes up with a JSON of parameters that you can sort of toggle in the UI like make it wider, make it smaller, and all that just pipes to code.

Anton

Yeah.

Rahul Sonwalkar

And-

Anton

And it's all dynamically generated UI, right?

Rahul Sonwalkar

Yeah. Yeah.

Anton

It's just super cool. And again, these are little innovations that you land on by having actual users and to get the actual users just build something useful, right?

Rahul Sonwalkar

Totally. Totally. I mean, I think there's, you've mentioned about GPT Wrapper.

Anton

It's mind-killing. It's just so mind-killing. It removes all creativity immediately.

Rahul Sonwalkar

You could be building so many cool things if you'd just gone a bit out of your head.

Anton

Yeah, maybe the model will do it someday, but I promise you're going to learn a lot about the users.

Rahul Sonwalkar

Yeah.

Anton

Yeah. Obviously GPT-4 does the stuff that you're saying it can do today, right? It can generate dynamic UI against data.

Rahul Sonwalkar

Yeah.

Anton

But you've packaged this in a way where it's actually useful to the user and you've learned that users actually want this in the first place.

Rahul Sonwalkar

Yeah.

Anton

Right? That's important. Let me ask you two last questions and then we can finish up. The first is what is the most surprising thing that you've encountered in building with AI models? Building software with-

Rahul Sonwalkar

Totally. I think there's a few different things that I found really surprising. So for context, I was a software engineer before, before building Julius, and I think one of the things is of course these models are non-deterministic, but that's sort of like an advantage, because you can run the same prompt from the same model and get a different output a different time. And you can sort of leverage that for something really, really powerful, which is you can give it new tools and get it to fix errors.

Anton

And dynamically generate those tools even.

Rahul Sonwalkar

Dynamically generate those tools even. I think the other thing which I couldn't have predicted when I first started building products in AI, it was that I was trying too heavily to take the concepts that I had before and the ideas that I had before and try to apply them in an AI native way.

Anton

Right.

Rahul Sonwalkar

So an example of this is, let's say, Cursor.

Anton

Right.

Rahul Sonwalkar

Right? So I use Cursor's AI, IDE, and what they're doing is they're kind of thinking from first principles, like if you had an IDE that was just AI native, what would that look like?

Anton

Yeah, what would it do? Yeah.

Rahul Sonwalkar

Another example of this is I don't think anyone could have predicted a tool like Midjourney.

Anton

No, of course not. No.

Rahul Sonwalkar

Yeah. It's a interesting phenomenon where millions of people want to type a prompt into Discord and watch an image being generated. And-

Anton

I think the real power of that actually, I thought for a long time is like, "Why is Midjourney so successful?" There's plenty of ways for people to get pictures, right? There has been for a long time.

Rahul Sonwalkar

Yeah.

Anton

I think the most overlooked piece in UI for AI in general is how iterative it is in getting to the end of the problem. So for example, if you're writing software, it's very difficult to write it iteratively because you have to have a completed program before you even get the first output, before you can even start iterating on it.

Rahul Sonwalkar

Yeah.

Anton

With AI, you can have partially something that has the right vibe and then you can start narrowing it down and narrowing it down in a really iterative way. I think that's very unique about UI.

Rahul Sonwalkar

Absolutely.

Anton

AI, sorry.

Rahul Sonwalkar

Yeah. Yeah, I agree. And it's just there's so many interesting ideas out there to build that, I mean, who could have predicted these tools, right? Who could have predicted Julius? It's just the best way is you-

Anton

Experiment.

Rahul Sonwalkar

Experiment. You start-

Anton

Play with it.

Rahul Sonwalkar

Play with it. And these models are really, really, really fun to play with once you have sort of the intuition how to play with them. Julius wasn't the first idea that we worked on. It was probably the fifth or sixth idea. And the first five didn't work out, and it's okay.

Anton

But you learned something from them as well, right? You started to kind of onboard a sense, a feeling of how the models work and what they can and can't do and how you should work with them. I've definitely had that sense too. I think it's really important, even though Chroma isn't really a model company and we don't build AI applications, we do support the builders of AI applications and playing with the model pretty... I play with it every day. I try to make it do something unusual while I have some long compute running, and it's fun. One of my favorites is to try to gaslight GPT into believing things that aren't true.

Rahul Sonwalkar

A little bit.

Anton

I mean, look, for example, I have this chat somewhere where I try to convince it that there was a nuclear war in 1812 and it's being deliberately held away from its training data and I'm asking it, "Why would someone lie to you like that?" And you're watching it struggle, which is a little mean, but I want to see how it thinks. And the other thing is the Golden Gate Claude was very funny.

Rahul Sonwalkar

Yeah.

Anton

I don't know if you gave it a shot.

Rahul Sonwalkar

I loved it.

Anton

Yeah.

Rahul Sonwalkar

Yeah. I would-

Anton

Just fun. This is fun. It should be fun. That's got to be the core of all of this.

Rahul Sonwalkar

Yeah.

Anton

Let me ask you the last question for this, which is obviously we're describing Julius as this fairly, it's not mature, it's very much still being built, there's a lot to do. The core of it is very, very useful. People love it. But it didn't spring forth fully formed. Right? So what did you wish you knew or what's pain you ran into that you could have easily avoided if you knew about it in advance? Is there anything like that in the story of Julius?

Rahul Sonwalkar

That I should have avoided?

Anton

Yeah, if now you've got the experience, so you've learned the lesson, but if you had already known that, if you had given advice to your previous self when you started building Julius, what would that be?

Rahul Sonwalkar

There's two things. One of them is remove the complexity to getting the user, a new user, to that awesome wow moment.

Anton

The wow moment? Yeah.

Rahul Sonwalkar

Yeah. So when a user starts using Julius, we used to put a lot of things in the way that we thought the users would want, but really... So one of the things was we thought the more things we add to the UI, the more it would educate the user on how to use the tool.

Anton

Yeah.

Rahul Sonwalkar

And we try all these ideas and the users would just sort of go to the input box. They just wanted to talk to the AI.

Anton

Nobody ever reads. That's definitely true.

Rahul Sonwalkar

Nobody ever reads. And soon we realized that all this stuff is getting into the way of the user. We have to make the first experience really, really simple, and then we can still add those things into the chat as generative UI features.

Anton

Right.

Rahul Sonwalkar

And sort of-

Anton

But that requires knowing about the model that you as the developer of Julius have the confidence that the model can perform this task.

Rahul Sonwalkar

Yeah.

Anton

Yeah.

Rahul Sonwalkar

Yeah. And then sort of selectively surface that to the user when you think it is appropriate for the circumstances [inaudible 00:50:03].

Anton

See, I think that that's actually also very, very underrated in AI development right now, is the degree to which you can capture user intent to also empower them and bring them through that ladder of confidence with the tool. Right? This is a theme that keeps coming up when I speak to people building an AI, is there's always this ladder of trust and complexity that you have to build with a user, because they're not experts in the model. They don't know what the model can do. And your job is to produce a product that can perform a useful task for that user, not to just dump everything on them all at once. And AI can be really, really good at that, because they know where the user's at, and the model kind of knows what can be done. And so connecting ability to intent at the right time is something really powerful and underexplored right now I think.

Rahul Sonwalkar

Yeah, absolutely. Totally I agree. And the second thing which was I think we spent a lot of time on which we could have simplified, which is we thought we're taking a language model and we are giving it the shell of a computer, and we are letting it write all kinds of code, and that's awesome. And the possibilities of this are endless. It can do so many tasks for you like go write scripts and execute them. And an example of this is a user, there's a user of Julius, he runs an e-commerce business, and he is not a programmer, but he's able to use Julius to find the reviews on his competitor's website and then do some competitive intelligence and then collect data and get those insight using things like BeautifulSoup, et cetera. And we thought, "Whoa, this is awesome. This can do anything possible." That's not the case. We spent some time in believing that, when we could have primarily focused on what are the problems people looking to solve? And how can we build a tool to solve that problem? And how do we communicate that to the user?

Anton

Classic. I mean, that's classic software development. This is the thing. The principles translate despite the fact that this is a fundamentally new technology, right?

Rahul Sonwalkar

Yeah.

Anton

The principles under which we actually build useful things end up translating, get to users, iterate quickly, understand the problems, see what they're struggling with, and make that part better, right?

Rahul Sonwalkar

Yeah.

Anton

Yeah, ultimately, the message really is just build something that you think might be useful and try it out.

Rahul Sonwalkar

Yeah.

Anton

Or fun. That's the other thing. AI is super playful, and I think that we're coming out off a little bit of a SaaS hangover right now with software, because the late 2010s were these SaaS companies selling sales automation stuff and fine businesses, but very little play in building something like that. AI is just, it's so much more playful. It reminds me very much of the early web. People would put up random websites about whatever and immediately be able to show the world. You're able to just get the models to do whatever you want. Anyone has access to them in the same way, so it's just much more playful than that.

Rahul Sonwalkar

100%. It's just so much more playful. I think given where the models are today, we have probably only built 1% of the applications it can build with it.

Anton

I completely agree with you. I completely agree. What I actually think is the speed limit right now probably isn't model capability. I think that even what you've taught us with Julius, that's already fairly science fictional even from five years ago, that you can have these general purpose generative UIs just popping up for serious data science. So I think that the model capabilities are no longer the bottleneck. I think the bottleneck is actually human imagination and human trying stuff.

Rahul Sonwalkar

Yeah.

Anton

Right? So we really want to encourage that. People should do it.

Rahul Sonwalkar

You want more ideas, you want more people playing with those ideas.

Anton

Yeah. People should build.

Rahul Sonwalkar

Yeah.

Anton

All right, man. Thanks very much. That was great. Cheers.

Rahul Sonwalkar

Thanks for having me, Anton. Yeah.

Anton

Of course.

Rahul Sonwalkar

Yeah.

2024 Chroma. All rights reserved