Rahul on X (Twitter): @0interestrates
Anton
What you've just described is an AI driven software engineering process, which Julius AI is already using. You've adopted this so seamlessly. I think that's super powerful.
Rahul Sonwalkar
The model writes code, runs code, shows you a cool visualization about your data. And you want each new user who comes to Julius and there's thousands of new people who come in every day. We really obsess over showing them that wow moment in the fewest steps possible.
Anton
Again, the point of this series is to basically talk about the practical things about building with AI, right?
Rahul Sonwalkar
Yeah.
Anton
AI applications, AI software development, it's a totally new field. Even though it feels like it's moving superfast, it's like about 12 months old. The people who started building with AI only really started building in November '22, and today we're in May '24. It's just not that long. We're all figuring it out. A lot of people building come from a software background, and I think it's valuable to talk about that. But before we get started, why don't you introduce yourself and Julius?
Rahul Sonwalkar
Totally. I'm Rahul. I'm the founder and CEO of Julius. Julius is an AI data scientist that helps you analyze data sets, get insight from your data and create good-looking visualizations. We launched about 10 months ago, and since then have crossed half a million users, and as of this week, Julius writes and executes over a million lines of code every 48 hours.
Anton
That's incredible. I love that metric about the model writing code, and most people who haven't really played with this technology aren't even aware that the models can write code, right?
Rahul Sonwalkar
Yeah.
Anton
Maybe that's a good place to start. How do you get the LLMs to write code? What do you do? How do you make that happen?
Rahul Sonwalkar
Totally. In short, it's primarily prompting. We use models that are available publicly in most cases, and we realized at some point in time that these models were really good at writing code because they were helping us as engineers write code. And we realized that code in itself is a really powerful tool that needs to be democratized. There's a lot of people out there who could use the power of code, but just don't know how to write code or how to run it, so Julius helps them do that. Julius, you can just talk to it and we take the English commands and then turn that into code, run that code, and let the AI decide whether that output is sufficient or not.
Anton
You get that feedback loop built in almost with writing code as opposed to with language. You execute the code. Either it did what you want or through an error or something, right?
Rahul Sonwalkar
Absolutely. Let's say you ask an AI to write an essay, you don't have any feedback loop. It will just output the essay or an article, but with code, the beauty is that you can break a complex task down into chunks and progressively write some code, then execute it, see the output and decide whether this is the direction you want to proceed to tackle this task, or you need to try a new approach. Kind of like a human engineer would.
Anton
It's interesting. I always hesitate to apply a human or anthropomorphic theory of mind to these models, but it is helpful as a shortcut a lot of the time, right?
Rahul Sonwalkar
It is helpful as a way to build the right products.
Anton
Because thinking it in that way at least helps you think about how would a human use this, right?
Rahul Sonwalkar
Yeah.
Anton
Let's drill down a little bit more and getting models to generate code. How does that look like from literally calling to the model? How does it work? Do you just send an English instruction, like generate the code in Python?
Rahul Sonwalkar
Great question. One of the good things about these models is that they were meant to be general purpose models, and this makes them super versatile. You can use a model to write articles and do creative writing, but you can also get it write code, but that's also challenging when you only want to write code and you want it to focus on writing code, debugging code for a particular use case.
Anton
Because the model typically returns text. We've all had the experience with ChatGPT, where it's returning way too much commentary and all you want is the answer, right?
Rahul Sonwalkar
Yeah, absolutely. What we do is we heavily use tool use. Tool use is you're giving the model a prompt and you're saying, "Hey, this is who you are, this is the kind of tasks you're going to do, and here are a bunch of tools that will help you achieve the task."
Anton
And that's a specific API. It's like that a field... Because typically with prompting, you have the role field on each message. So you have the system prompt like you're describing. It's like this is who you are and what you do. Then you have a user role, which is this is input coming from the outside that a user might be putting in. Does tool use live in the role in the system? Where do you put these tools?
Rahul Sonwalkar
Great question. Today, as the recording of this video, both ChatGPT or GPT-4 and Anthropic Claude, they support tool use in their API. There's a field called roll, which is system, user or assistant. Then there's content, which is the actual text you're passing to the model. And then there's a new field, it's either functions or tools where you're describing in simple English what the tool is supposed to do. Your tool could be anything actually. You could in English describe whatever you want the tool to be, and then you also describe what the output or the input for the tool should be.
Anton
As API for that tool almost, right? The arguments to a function, hence function calling.
Rahul Sonwalkar
Exactly. The arguments to a function, hence function calling. Let's say I'm making an agent to schedule meetings and I can say, "Hey, you are an agent. Your name is Anton. You schedule meetings, your input will, the email, which has a bunch of times, and the tools available to you are check calendar, create invite, and send email." And then you can take that and give it to the model, and then model can parse your email and then decide, okay, it looks like the participants in this meeting have these availabilities. Let me check the calendar for the person who called me, and I'll call the check calendar tool. Now, you have to think about what are the parameters I would want the AI to give me so they can successfully pipe this output into my Google calendar API. You can say date, time, all those could be your inputs to check calendar and get the availability for the slots. And then the next would be a different tool called create invite or send invite, where you can define the participants, you can define what the time slot is, what is the email of each participant, and what's the event name, event description. All these could be parameters that you're telling the AI you need in order to use this tool. This is how tool use works.
Anton
The model basically is capable of figuring out how to use the tools in context, and then what does it emit when it decides to use a tool? Because again, we're used to the model emitting English text, but now it sounds like you want to emit a function call. Is there a specific field on the output? Do you have to find it in the text, the model emits?
Rahul Sonwalkar
One quick caveat there is off the top of my head, both Anthropic and OpenAI support function calling natively in their API, but there are ways you can prompt engineer this into other models as well.
Anton
Sure.
Rahul Sonwalkar
For the longest time, the way we did it for Anthropic was just put into the prompt a bunch of functions and tell the model, "Hey, here's the tools you have and you can use these tools." And then put very specific instructions on what text to output when it's using the tool, and then use some sort of Regex to parse that in the model's output as indication that the model is trying to use a certain tool.
Anton
Of course, then you're stuck with the problem of running a Regex over unstructured human language, which is... It's one of the biggest problems that we have is the flexibility of the models are a blessing and a curse. It's very difficult to actually definitively constrain them.
Rahul Sonwalkar
True.
Anton
Let's talk a little bit about what happens. Let's go back to our calendar app example, and then we'll unroll this all the way back to Julius. You've extracted that the model says here, make a function call. Now we're exiting the model execution environment and we're back in the, let's call it traditional software execution environment. What happens next?
Rahul Sonwalkar
I CC my agent on my email and say, "Hey Anton, I want you to schedule a meeting." And the output, the way I would build this application is I would pass the entire text of the email thread into the model whenever CC'd and I would present to it a few different tools. I would present to it, check calendar, send invite, send email.
Anton
Suppose it calls that check calendar to check calendar tool in the way that we described, but it's not actually... It doesn't have access to the tool itself. You have to call that for the model. How does that happen and what happens next?
Rahul Sonwalkar
Great question. We want this model's output to trigger a chain of deterministic actions that we can do and use the model as a decision maker to send the action.
Anton
But how do you literally... Behind that check calendar, there's got to be some actual traditional functions, some code?
Rahul Sonwalkar
Yes. The way you would do that is when the model streams the output back, you look for the function being called. If you're actually just Regexing, you look for the Regex, or in other cases, when-
Anton
When it's native.
Rahul Sonwalkar
... it's native, the role will be function or the role will be assistant, but then there's a new field type, which is function.
Anton
Oh, because it could be type content or type function.
Rahul Sonwalkar
Yeah.
Anton
Got you. Makes sense.
Rahul Sonwalkar
This is off the top of my head. And then in the content, it would actually... When it's calling a function, it will give you the arguments that you have told it.
Anton
And then you pass the arcs and then it calls the function.
Rahul Sonwalkar
And then you parse those arcs, you pass them into a deterministic code function. Let's say you have an integration with Google Calendar, and you would say, I need these and these arguments as required fields. And then you can also specify optional fields. In this case, the required would be, sure, you want to check calendar, but what day do you want to check the calendar for? What are the days?
Anton
And the model's populating those arguments, and you send it along and you get some response from let's say the Google calendar API, and it's up to you to process that. Unrolling the stack, going back to Julius a little bit, obviously Julius isn't about just piping function calls through a UI. There's a lot more to it. And you've already talked about how there's multi-step reasoning built in because you want to support people who don't really want to understand how to do the function calling themselves. Let's enroll that a little bit. We get output from a function that we call deterministically based on arguments that the model is passing us and say, "Hey, I want you to call..." Basically what the model is saying is, I want you to call this for me. You go out and you call it, maybe you process it some more. But in this multi-step environment, like with Julius where you're asking to analyze data, we have to also provide feedback about the result of that function call to model. And I imagine chaining those gets complicated, but what do you present back to the model after it's executed that? Do you tell it, okay, that succeeded, that failed? What do you do?
Rahul Sonwalkar
When this is natively supported in the API, your life becomes a lot easier because the native APIs, they support-
Anton
I hope Google heard that.
Rahul Sonwalkar
... the function. They support function output as a new field in the API, so you can present the output of the function back to the model. In the case of calendar-
Anton
As a user role, prompt function outputs is the field?
Rahul Sonwalkar
I think the role in that case is function.
Anton
Oh, interesting. You can put... Oh, okay, I got to try that. I haven't tried it yet. Because I still live in that world where you're Regexing the outputs. I haven't tried the full fat APIs yet.
Rahul Sonwalkar
If you're Regexing it, you would probably say in a user prompt, you would say-
Anton
This was the output.
Rahul Sonwalkar
This was the output.
Anton
Got you.
Rahul Sonwalkar
I ran that code. This was the output. What should I do next?
Anton
Okay, makes sense. And then the model-
Rahul Sonwalkar
You'd run it in a loop. And my hope in the calendar example is after checking the availability, the model will probably decide to call, send invite.
Anton
Got you. Let's talk about Julius, because we've talked about a very specific task performing model, but Julius's general purpose. Julius is designed for... Mostly right now people are using it for data analytics, if I'm right, data science?
Rahul Sonwalkar
Yeah.
Anton
Obviously, data comes in all different shapes and sizes, it comes in different sources, and the tasks you want to perform with it are super, super dynamic. We've talked about these three tools. What are the tools that Julius presents to the model?
Rahul Sonwalkar
Great question. When we first started, it was one tool, which was run Python code. Because Julius was built over a period of two or three weeks. It was very bare bones when we launched, we didn't have the time to, even in the UI, use any design. It was for all the icons, we would just use emojis. It was-
Anton
That's the trend now, right?
Rahul Sonwalkar
Yeah. But when we launched, it was-
Anton
Right?
Rahul Sonwalkar
Yeah.
Anton
Yeah.
Rahul Sonwalkar
But when we launched, it was just one or two function calls. But what we realized as we-
Anton
Wait, which function calls did you have at launch?
Rahul Sonwalkar
It was, I believe, off the top of my head, it was read file and the other was run Python code, which was kind of like a catch all-
Anton
Okay. So in the response to that run Python code tool, the model would put Python code in there for you to run in your interpreter?
Rahul Sonwalkar
Exactly.
Anton
Okay. That's primitive. That's early.
Rahul Sonwalkar
That's early. And [inaudible 00:14:29] for effort back then was spent towards setting up the infrastructure for the code interpretation. So for context, each user on Julius gets their own sandbox. You can upload your files to the sandbox. It has some CPUs, some memory, and the model can freely run code in the sandbox and do things and iterate in a loop and do the task you give it. So bulk of the effort was spent towards making those secure, getting those up and running, and then building the interface. I think that's one of the things about building applications is you probably want to innovate on things that will help you build a better product right on the bat and then get users and then iterate.
Anton
Yeah, that's what I always say for people building with AI today. And you and I have talked about this many times, and I mentioned it also previously when I spoke with Flo, but sometime in early '23 people started using this phrase GPT wrapper.
Rahul Sonwalkar
Yeah.
Anton
And I think that's such a mind killing phrase. It immediately kills thinking and creativity about what we actually can do.
Rahul Sonwalkar
Yeah.
Anton
Right? And I think that people still don't understand what people want out of software or something that's running on their computer is they want it to perform a defined task.
Rahul Sonwalkar
Yeah.
Anton
That's what they care about, right? And they don't want to think about how do I wrangle this general purpose system into performing this specified task for me, right? Which is essentially like, Julius smoothed that process for data science for a lot. Right? So I really want to highlight at this part, how much contact with the real world user actually really matters with building with AI, right? You have to figure out what really matters, what they care about. And there were surprises for you along the way as well, right? Julius has really taken off, if I remember, in the education context right now too, which is super, super cool.
Rahul Sonwalkar
Absolutely. So when we first launched we thought the users for this would be data scientist and data analysts. Turns out most of the users are people who have data on their hands and have a lot of curiosity about that data, but they like the expertise to do the data science or data analysis on it. And turns out it's just the market is crazy.
Anton
It's so empowering, right?
Rahul Sonwalkar
Yeah.
Anton
AI is empowering because it puts all of these things within reach. So despite the fact that I'm building a software company, I hate using computers. I don't like the fact that the problem I'm actually trying to solve forces me to go through this bottleneck of writing code. So now I have to learn how to write code and use the computer appropriately to complete the test that I actually want to complete.
Rahul Sonwalkar
Yeah.
Anton
So I can imagine being a scientist and being very frustrated by this because you have this rich complex world that you understand very deeply and you're only able to express it. It's like a foreign language. It's like when you're learning a language at the start, you can only express yourself very poorly in it. So imagine you have these great ideas in science, but your only language is code, which you don't understand very well.
Rahul Sonwalkar
Yeah.
Anton
Julius gets that out of the way, which is super, super cool.
Rahul Sonwalkar
Thank you. But absolutely. So to your point about GPT wrappers, it really depends on who's asking. I feel like there's some people who ask from a place of good faith, and they're just genuinely curious, whereas this use GPT or what model does it use? And for them I have a really honest answer like, hey, this is what we do and this is how we built it. And I actually want more people to build really cool AI applications.
Anton
I'm the same with you. That's why we're talking.
Rahul Sonwalkar
Yeah.
Anton
Julius should be an inspiration to many people.
Rahul Sonwalkar
Yeah. There's so many people who don't get started because they want to build something specialized and really complex.
Anton
Yes.
Rahul Sonwalkar
So there's this whole thing about-
Anton
Classic trap.
Rahul Sonwalkar
It's a classic trap. It's a nerd trap where I want to look cool to my colleagues or my friends. And this is a running joke I have with Tristan who is the founder of Readwise. And he's in Toronto and they do a lot of stuff with AI and I told him, "Dude, one of the lowest status things you can do in San Francisco today is build a GPT wrapper." Because you don't want to go to a party and say, I'm building an AI application. And they ask me, so are you training your own model? And you say, no, we are actually using the best models for what they're good at. And then building a product with it.
Anton
Yeah. Low status stay is winning.
Rahul Sonwalkar
Low status stay is winning.
Anton
Low status stay is winning.
Rahul Sonwalkar
Yeah. Yeah. Ultimately, our users don't care what model we use.
Anton
That's right.
Rahul Sonwalkar
Our users really care about what problem we solve for them.
Anton
Yeah.
Rahul Sonwalkar
And as long as we, whether it be training our own model is the best way to solve that problem, we'll do it. Whether it's like using somebody else's model to solve the problem, we'll do that.
Anton
Yeah.
Rahul Sonwalkar
But ultimately we are building something that people want.
Anton
It's honestly kind of the same for Chroma. I mean, when we launched people were like, "What's difficult about [inaudible 00:19:28] nearest neighbors search in vector space?
Rahul Sonwalkar
Yeah.
Anton
It's like, that's not our job. Vector search is the way that we perform our job today. And scalable vector search is something that's necessary to perform our job. But what Chroma is really for is getting data to the models so that you can build applications with data that the models can then process for you. That's what we do. And if there was a different way of doing it, I would just do it that way.
Rahul Sonwalkar
Yeah.
Anton
So it's similar. It's what does the user actually care about in building your application? I want to go back to a little bit more of the technical details because we got started, we've talked about this one iteration of this loop, right?
Rahul Sonwalkar
Yeah.
Anton
And we've talked about the happy path where our agent manages to perform the task and gets the feedback that, hey, you're making progress.
Rahul Sonwalkar
Yeah.
Anton
You mentioned at the start of our conversation here that Julius does this multi-stage.
Rahul Sonwalkar
Yeah.
Anton
So let's first talk about that. Obviously there's the very rigid way of doing things, which is like first perform this and then if you succeed, perform that.
Rahul Sonwalkar
Yeah.
Anton
But that's not what Julius does, right?
Rahul Sonwalkar
Yeah, absolutely. So there's some degree of freedom we give to the model. And one of the things, one of the little secrets about building Julius is we figured out pretty early on that you can actually hack the shit out of... Can I cuss on this podcast?
Anton
You can cuss on this, you can do whatever you want.
Rahul Sonwalkar
We figured out pretty early on that we could hack the shit out of two [inaudible 00:20:50]. And I'll share a couple examples of how we do that. But it's sort of like these models are really good at, they are becoming good at picking tools to use.
Anton
Yes.
Rahul Sonwalkar
So the more specialized tools you make, the better they become at performing tasks.
Anton
That's very interesting because that's a long way away from giving it a Python interpreter, right?
Rahul Sonwalkar
Exactly. When you give it a Python interpreter, sure, we'll give you a Python code, but when you give it a tool that says fix this specific kind of error in the code, in the Python code, it suddenly becomes a lot better at fixing that kind of error.
Anton
It's like getting in the model to focus even though it's general purpose.
Rahul Sonwalkar
Exactly. Get the model to focus even it's general purpose. And we only came to this realization after we got a lot of users and we noticed patterns in which the model would succeed, and patterns in which the model would fail. And you would realize, okay, these are the common patterns and these are the common errors that happen in the code when you take the code-
Anton
What are some of those? Would you share some of the concrete ones?
Rahul Sonwalkar
Yeah, a very simple one is module not found. You give it a Python interpreter and you tell it, hey, the user's giving you this task, run code to do the task. And it tries to import some modules, run that code and it realizes, oh, there is no module. This module doesn't exist in the environment. Now there's a couple of different things that happen. One is you can simply reinstall the module, but you want to make sure that the version of the module is compatible with other packages in the dependencies and there is no conflict issues. There's another example of this is fixing name not found. So what we realized is when users upload data sets, the data sets are really messy. So sometimes the model, it's very limited understanding of what the user's dataset is. It tries to write code and run the code and it runs into a bunch of error. And when we realize through some telemetry that, oh, this error is related to the dataset, we give it functions like clean dataset, which has very... So it's not just the name of the function. The function has a field called description where you can give it really, really specific descriptions and instructions on, hey, try these five things because that's where most errors tend to be.
Anton
So let's focus on this data cleaning use case.
Rahul Sonwalkar
Yeah.
Anton
So we've identified that we need to run the data cleaning tools.
Rahul Sonwalkar
Yeah.
Anton
You've presented the model with the tool use API, basically the tool use fields that it then goes and populates, right?
Rahul Sonwalkar
Yeah.
Anton
But the model isn't cleaning the data itself. So is it specific, let's say a Python function that you know these are common errors in the dataset or what is it?
Rahul Sonwalkar
So an example of this could be you have a column called revenue.
Anton
Yes.
Rahul Sonwalkar
And the revenue has, sometimes it has a dollar sign, some of the numbers have dollar signs, some of the numbers are just numbers.
Anton
Yeah.
Rahul Sonwalkar
And this really confuses the model because you can give it like a [inaudible 00:24:01] dot head and it will give you top five rows. But when you have 1,000 rows and-
Anton
Yeah, that dollar sign could be on row number 700 and it's the only place that's in the entire table.
Rahul Sonwalkar
Exactly.
Anton
Yeah.
Rahul Sonwalkar
So you can give it instructions like check if the values in this column are consistent, and if they're not, find the values that are not consistent. Write code to find the outliers or values that are not consistent.
Anton
So those functions are dynamically generated.
Rahul Sonwalkar
So we check what column the error happened in. Let's say the model tried to create a sum of revenue like what's my total revenue?
Anton
And then some error got spat out.
Rahul Sonwalkar
So error, the error lock says this column had a name error or type error. This case is type better. So we check that and we say, okay, the column is this, type error is this, let's write dynamic instructions for the model to inspect column. Function is called inspect column. You give it-
Anton
But this inspect column, this is something that you have for want of a better word, hard coded somewhere because it's so common, is that right?
Rahul Sonwalkar
So in some cases, yeah, [inaudible 00:25:13] is not hard coded, but there are other cases where their function is just purely hard coded in just pure Python code.
Anton
Yeah.
Rahul Sonwalkar
In this case, we are giving them a little more freedom of hey-
Anton
How to implement. But what I mean is is this task of inspect column is so common that you know to inject that into the prompt already basically, right?
Rahul Sonwalkar
Yeah.
Anton
You're like, okay, this is an error we see 50,000 times. Ask the model to now specialize to a model that's only cleaning, only inspecting columns.
Rahul Sonwalkar
Exactly.
Anton
Right? Is that right?
Rahul Sonwalkar
Exactly. So the agent completely changed from doing this general purpose to doing this very specific thing, which is at this point in time, the only thing I care about is inspecting the column and fixing the type error in the column.
Anton
Right. Hence how we get to this multi-stage thing now. Right?
Rahul Sonwalkar
Exactly.
Anton
Yeah.
Rahul Sonwalkar
And you can sort of remove other tools from the context.
Anton
Yes.
Rahul Sonwalkar
And add more tools to the context that are really specific to what the model [inaudible 00:26:09]. And you're spot on about the 50,000 times. That's exactly what we do is we collect those patterns on like, oh, this error happened X thousand times this week. How do we fix that? An example of this was November 7th, when GPT-4 Turbo launched, we realized that our file import errors just skyrocketed. And turns out OpenAI in their training data, they collect a lot of data. The file pads are /mnt/data in the prefix. So all file pads are /mnt/data.
Anton
So it thinks all data has this /mnt in front of it.
Rahul Sonwalkar
Yeah.
Anton
Yeah.
Rahul Sonwalkar
And then we tried to fix that with prompt engineering. We tried to tell it in the prompt.
Anton
Ignore mnt, look at the actual path.
Rahul Sonwalkar
Yeah.
Anton
Yeah.
Rahul Sonwalkar
That sort of brought it down, but still-
Anton
Did you end up just stripping it out of a pure string substitution?
Rahul Sonwalkar
That's what we ended up doing
Anton
Hell yeah.
Rahul Sonwalkar
Is after we get the function output, the model would just keep putting in the function and we would detect that and just remove it from the string. It was funny, but-
Anton
No, but this is the thing, right? Yes. Maybe there's a future in which the models are perfect and can figure out intent with a very high degree of precision.
Rahul Sonwalkar
Yeah.
Anton
But in the world that we live in today, we have to really compensate for the unpredictability and limitations. So let's talk about, before we get to, because there's so much fascinating stuff here, you've given an example of one of the more unusual complications with building with LLMs, which is like the model can drift under you. We're used to it in software, someone pushes a package somewhere, API compatibility is broken. It's like five points down the tree.
Rahul Sonwalkar
Yeah.
Anton
But here again, because the models are sort of flexible, it's like did something even change? How do you look for that?
Anton
... they're flexible. It's like, did something even change? How do you look for that? Obviously you guys have got aero telemetry, but...
Rahul Sonwalkar
Yeah, totally. That's been an ongoing thing that we are always trying to improve. And I don't think we will have a solution that would solve it for us for all future-
Anton
You just basically have to stay on top of it, right?
Rahul Sonwalkar
You just stay on top of it. There's a couple things you can do. Right now we're investing in evals. For the longest time, our evals were vibes where-
Anton
Real.
Rahul Sonwalkar
... for a really long time, where evals were all vibes. Where we realized that, as long as we are talking to a lot of users and we have this type of feedback loop with our power users-
Anton
You're on the phone with them.
Rahul Sonwalkar
Yeah. So one of the things users, if you pay for the pro plan on Julius, you get my phone number. And usually it's all nice things. People text me candid feedback like, "Hey, it would be great if we could have this."
Anton
Yeah.
Rahul Sonwalkar
But about once a month we go down for a few hours and then they call me at midnight. "I'm trying to finish this work and this is down!"
Anton
"I've got this assignment due!" Yeah.
Rahul Sonwalkar
Yeah.
Anton
I get some of those, but via email. Anyway, so yeah, you talk with the power users. You were talking about evals.
Rahul Sonwalkar
Yeah. That type feedback loop got us pretty far. And I think one of the things about building applications or AI applications is there's a lot of noise of people that will tell you, "You need this to build an application."
Anton
Yes.
Rahul Sonwalkar
"You need evals."
Anton
Yes.
Rahul Sonwalkar
"You need the best this and you need... and before all this, before you-
Anton
Which is, again, it's a classic software trap.
Rahul Sonwalkar
Yeah.
Anton
I'm always inspired by Peter Levels, who's like, the guy just cranks out websites and web apps. Right?
Rahul Sonwalkar
Yeah.
Anton
People often ask him what frameworks and whatever he uses. He's like, "No, man, it's just a raw PHP and jQuery. I built these multimillion dollar businesses with raw PHP and jQuery."
Rahul Sonwalkar
Yeah.
Anton
It's literally like understanding the user is what's the important thing.
Rahul Sonwalkar
Yeah.
Anton
It's very inspirational. If you need inspiration about how you should build software that people actually really care about, just go and look at some of the stuff that Peter built.
Rahul Sonwalkar
Absolutely. He is a big inspiration. I think he's a little crazy for using PHP and jQuery, but I respect that.
Anton
Look man, it works, right?
Rahul Sonwalkar
I respect him for doing that.
Anton
Yeah.
Rahul Sonwalkar
None of us on the team are front end engineers, but we build front end and we use Workshell for that.
Anton
Yeah.
Rahul Sonwalkar
And there's-
Anton
There's tools out there that you could plug together and just use, right?
Rahul Sonwalkar
Exactly.
Anton
Don't overthink it.
Rahul Sonwalkar
Don't overthink it.
Anton
Yeah.
Rahul Sonwalkar
I think, in case of Peter levels, it's more like he just knows PHP and jQuery. Why would he learn a new thing?
Anton
Exactly. And he's fluent in it which is again like, why would you learn another language? Why would you download how another person thought about this problem when it's not the problem you're trying to solve?
Rahul Sonwalkar
Yeah.
Anton
Anyway, back to evals. So now evals matter because you have enough... when did you decide to make that jump?
Rahul Sonwalkar
We decided to make that jump around the time when we started supporting a second programming language, which is R.
Anton
Yeah. Pretty recent, right?
Rahul Sonwalkar
Pretty recent, yes.
Anton
Cool.
Rahul Sonwalkar
So our users for the longest time emailed us and said, "This is awesome. This is so helpful. Could you support R?" And we didn't get it because none of us on the team knows R, and we just told them, "Guys, you can do the same thing in Python."
Anton
Python, yeah.
Rahul Sonwalkar
But turns out that-
Anton
They don't speak Python.
Rahul Sonwalkar
They don't speak Python, and a lot of their other work is in R or the people they work with use R.
Anton
Yeah.
Rahul Sonwalkar
So we launched R kernels on Julius. Now Julius can also write and execute R code. And when that happened we realized, oh, the dependencies we've got to manage are 2X. We've got to manage the R kernels and Python kernels.
Anton
Yes.
Rahul Sonwalkar
And now we have thousands of users using it every day. We could unintentionally break things and not know about it.
Anton
Yes.
Rahul Sonwalkar
And our testing matrix is just doubled now because we have to have twice as many tools, we have to have-
Anton
Well, it's a combinatorial explosion all the time.
Rahul Sonwalkar
Yeah.
Anton
And it's like, given what you've said about building specialized tools to fix errors specific to, for example, importing Python modules, as soon as you introduce another language in your model, that means more specialized tools.
Rahul Sonwalkar
Yeah.
Anton
Yeah.
Rahul Sonwalkar
Exactly. So that's when we decided to build evals. And it's an ongoing project.
Anton
Of course. It never ends.
Rahul Sonwalkar
Yeah. It's a thing we have to collect over time. You just collect evals, and how do you use evals. But that's sort of how we do it reliably.
Anton
What are some of the evals that you run?
Rahul Sonwalkar
A pretty common one is just end-to-end integration, which is, when a user uploads a data set and you ask a simple question like, "Show me something interesting about that data," can we get the user to a magic moment as soon as possible?
Anton
Right.
Rahul Sonwalkar
Which is the model writes code, runs code, shows you a cool visualization about your data.
Anton
Yes.
Rahul Sonwalkar
And you want each new user who comes to Julius, and there's like thousands new people who come in every day, we really obsess over showing them that wow moment in the fewest steps possible.
Anton
Yep.
Rahul Sonwalkar
And so that's a pretty common thing that we want to make sure always works. There's some crazy things. For example, we have a user in the UK who does TensorFlow stuff on his phone, so he doesn't even use the Julius web app. He uses the phone. And he's started to do some model training on his phone because he just loves how you can do it just with English. He's actually a data scientist. And one of the things is, because he's such a power user, we love supporting him.
Anton
Yeah.
Rahul Sonwalkar
So do those dependencies break when we add a new package to the environment by default? And there model. The models themselves, they change. So the models, they have a knowledge cutoff date.
Anton
Yes.
Rahul Sonwalkar
So when you're using GPT-4-0314, which is a year and two months' old model, its knowledge cutoff is 2022. And 2022, there was a different version of Pandas or TensorFlow. And that's the version that the model remembers.
Anton
Yes.
Rahul Sonwalkar
Now, you can either do two things. One is you tell the model to use the latest version that it doesn't know, or you make restrict your environment [inaudible 00:34:38]. And that complexity also increases when you want to support multiple packages.
Anton
You also have to online learn what the model knows about what packages are available at any given time, right?
Rahul Sonwalkar
Yeah.
Anton
Because it's strongly a function of not just, when did this package start to exist? But how common was it in the training data, right?
Rahul Sonwalkar
Yeah.
Anton
Yeah. What a pain in the ass.
Rahul Sonwalkar
Yeah, it is a real pain. And it became a real pain with R because R packages are pinned to the version of R.
Anton
Yes. It's a whole different package management metaphor.
Rahul Sonwalkar
Yeah.
Anton
Yeah.
Rahul Sonwalkar
So making sure. We have a lot of evals around that.
Anton
Good.
Rahul Sonwalkar
Do our dependencies work? Are they compatible with the model? Because let's say you threw in GPT-4-L and the knowledge cutoff date is late 2023. Now we have to go back and recheck what packages-
Anton
What packages are available to the model?
Rahul Sonwalkar
Yeah.
Anton
Yeah. That's great as an eval. Do you guys have anything like unit testing for a specific specialized tool, or are you still going by vibes there?
Rahul Sonwalkar
We put in evals before we put in unit tests.
Anton
Okay.
Rahul Sonwalkar
Yeah.
Anton
No, but in some ways it makes sense, because the main determinator of whether or not your application is functioning correctly is actually the model itself.
Rahul Sonwalkar
Yeah.
Anton
It's the stability of the model. Speaking of which, there's two more questions I want to ask. The first is, these things are inherently non-deterministic. Are you running at temperature zero? Temperature obviously determines how deterministic the model's text completion is going to be given the same prompt. Are you usually running at temperature zero? A lot of people are.
Rahul Sonwalkar
Great question. We're running super close to zero.
Anton
Close but not equal to. Why not?
Rahul Sonwalkar
We just, from vibes, we thought it did better. I think we run 0.1 or something.
Anton
Okay.
Rahul Sonwalkar
One thing we need to, and this is a good idea I just got, which is, with error correction, we might want to try a slightly higher temperature. Let's say you're in the fourth loop of trying to fix an error.
Anton
Yeah. You don't want the same output again. I've seen this many times.
Rahul Sonwalkar
Yeah.
Anton
A while ago I did my little theorem prover project, and it would routinely get stuck. And it would try to fix one error and then end up back at the first error.
Rahul Sonwalkar
Yeah.
Anton
Create a new error, try to fix that error, end up back at the first error, and just loop forever.
Rahul Sonwalkar
Yeah.
Anton
So that might be a good way to escape. Try to extend the temperature.
Rahul Sonwalkar
Yeah, exactly. So extend the temperature, get it more creative. Yeah.
Anton
So that actually brings me very naturally to my next question, which was, we've talked about how we arrive at success, but obviously things fail pretty frequently. First is, how do you detect that failure given that things are so... and we talked about evals, where you can obviously see a regression that's global, like for example with the file strings, but obviously there are local failures, like for whatever reason the model fails to perform the task. How do you detect that failure?
Rahul Sonwalkar
Great question. There's this whole thing about deterministic part of software and non-deterministic part of software. These models are inherently non-deterministic but there are certain parts of the application that are super deterministic, which is what unit tests for.
Anton
Yes.
Rahul Sonwalkar
And one of the reasons we don't have unit tests is, usually when the deterministic parts break, we do get pinged about errors in the system. Like, "Oh, there's an exception thrown," or whatever.
Anton
Yes.
Rahul Sonwalkar
But when the non-deterministic parts fail, it's hard to get that done.
Anton
Yes.
Rahul Sonwalkar
It's really hard. Besides, you realize that your evals have gone down or lot of-
Anton
Users are complaining.
Rahul Sonwalkar
Users are complaining. So one of the things... and your question was, how do you detect that?
Anton
Yeah. I can imagine, for example, you've got some data set, like the users ask Julius to do something for them and it gets stuck in these error-correcting loops, right?
Rahul Sonwalkar
Yeah.
Anton
Among the tools that it has and the outputs that it's producing, it's not making progress, really. How do you detect it and what do you do about it?
Rahul Sonwalkar
So we make it super low-friction for the users to give us pulse on the system, and then that helps when you have users because you can collect telemetry from that.
Anton
Yes. Yeah.
Rahul Sonwalkar
So one of the things is, after every third message from the AI, we ask the user to rate the performance as if you would rate an Uber app.
Anton
Is that a compulsory rating or is it optional-
Rahul Sonwalkar
Optional.
Anton
Okay.
Rahul Sonwalkar
Optional. Yeah, so it gives us insight into, when things go wrong, we can see five stars drop and one stars shoot up.
Anton
Yeah.
Rahul Sonwalkar
And usually it is... we tried thumbs up and thumbs down and it didn't work as well. Users didn't use thumbs up and thumbs down, but when we put the stars in-
Anton
What does three star mean, if someone gives you a three star for the outputs? Mediocre?
Rahul Sonwalkar
Three stars are the most rare.
Anton
Yeah, for sure. It's interesting about the distribution, right?
Rahul Sonwalkar
Yeah. People, by default, put five stars, and then when things go bad they put one star. And we want to make sure that, by the third message, we get as many five stars as we can compared to one star and their ratio is as low as possible. So that helps us.
Anton
Yeah.
Rahul Sonwalkar
Another thing we do is when... we aren't allowed to look at users' data. That's just a policy we have with our users, is we don't look at your data.
Anton
It makes it harder to debug. It's also an issue for us. We're a data store, but we want to be able to help you without you having to ship us the thing you're having a problem with. And we're building tooling that's going to make that a lot easier pretty soon, but I'm curious to hear how you deal with the same problem.
Rahul Sonwalkar
Absolutely. So all the data you upload to Julius gets deleted after an hour of inactivity because we just literally destroy the code sandbox and nothing is retained besides the chat history. And you can also delete that anytime you want. But this is a policy we have with our users. So one of the things we do is we actually use smaller models that are cheaper to-
Anton
What are some of those, by the way? Just in that category.
Rahul Sonwalkar
Haiku is pretty good.
Anton
Claude Haiku?
Rahul Sonwalkar
Yeah. We used to use 3.5, which is also really good. Haiku I think is between 3.5 and 4, and it's pretty cheap that we can do this for thousands of tasks a day.
Anton
Small models for specialized tasks seem to perform very well.
Rahul Sonwalkar
Yeah.
Anton
Yeah. So what were you using Haiku for? Sorry.
Rahul Sonwalkar
We use it for classifications. So we use-
Anton
"What type of error is in this chat?" For example.
Rahul Sonwalkar
Yeah.
Anton
Great.
Rahul Sonwalkar
Yeah. Or getting an understanding of... so yeah, when you just upload the data, is it structured properly?
Anton
Yeah. Just now we've made an almost imperceptible transition because I think you and I are so used to working with these tools as part of our natural workflows, but I really want to highlight this part. What you've just described is an AI-driven software engineering process, which Julius AI is already using.
Rahul Sonwalkar
Yeah.
Anton
You've adopted this so seamlessly.
Rahul Sonwalkar
Yeah.
Anton
I think that's super powerful. It's very, very powerful. And we've done some of that for Corma. We looked at people's questions and answers about where they were struggling and where their product surface needed to be improved. But when you're working with the models on a day-to-day basis it becomes almost a natural second thing, to do that.
Rahul Sonwalkar
Yeah, absolutely.
Anton
Yeah.
Rahul Sonwalkar
You're still spot on. AI-driven software engineering.
Anton
Yeah. You're using AI in a software engineering process.
Rahul Sonwalkar
So true. There's so many features in Julius that are-
Anton
Hearing process.
Rahul Sonwalkar
So true. There's so many features in Julius that are the backend for the feature is literally a model call. So we never really wrote the logic. So an example of this is when you can always export your data at any point in time, whatever the work the model's done on the dataset, you can click an export button and it will export that data. Now, there's two ways we could have built this. One is we could have built-
Anton
For every data format, write an exporter. Yeah.
Rahul Sonwalkar
Yeah.
Anton
Huge pain.
Rahul Sonwalkar
Huge pain. Or what else we do is we put the export button, which basically tells the model-
Anton
It's a tool use call, right?
Rahul Sonwalkar
It's a tool use call that tells the model, "Hey, write code and run that code to export the data that corresponds to this code block or this output."
Anton
Yeah.
Rahul Sonwalkar
And the model just does it, and it's pretty well, it does it pretty well. We have a graph editor that's entirely GPT-4 based where you can sort of click, edit graph. And it takes the graph generated by the model, takes the code that generated the graph, and then comes up with a JSON of parameters that you can sort of toggle in the UI like make it wider, make it smaller, and all that just pipes to code.
Anton
Yeah.
Rahul Sonwalkar
And-
Anton
And it's all dynamically generated UI, right?
Rahul Sonwalkar
Yeah. Yeah.
Anton
It's just super cool. And again, these are little innovations that you land on by having actual users and to get the actual users just build something useful, right?
Rahul Sonwalkar
Totally. Totally. I mean, I think there's, you've mentioned about GPT Wrapper.
Anton
It's mind-killing. It's just so mind-killing. It removes all creativity immediately.
Rahul Sonwalkar
You could be building so many cool things if you'd just gone a bit out of your head.
Anton
Yeah, maybe the model will do it someday, but I promise you're going to learn a lot about the users.
Rahul Sonwalkar
Yeah.
Anton
Yeah. Obviously GPT-4 does the stuff that you're saying it can do today, right? It can generate dynamic UI against data.
Rahul Sonwalkar
Yeah.
Anton
But you've packaged this in a way where it's actually useful to the user and you've learned that users actually want this in the first place.
Rahul Sonwalkar
Yeah.
Anton
Right? That's important. Let me ask you two last questions and then we can finish up. The first is what is the most surprising thing that you've encountered in building with AI models? Building software with-
Rahul Sonwalkar
Totally. I think there's a few different things that I found really surprising. So for context, I was a software engineer before, before building Julius, and I think one of the things is of course these models are non-deterministic, but that's sort of like an advantage, because you can run the same prompt from the same model and get a different output a different time. And you can sort of leverage that for something really, really powerful, which is you can give it new tools and get it to fix errors.
Anton
And dynamically generate those tools even.
Rahul Sonwalkar
Dynamically generate those tools even. I think the other thing which I couldn't have predicted when I first started building products in AI, it was that I was trying too heavily to take the concepts that I had before and the ideas that I had before and try to apply them in an AI native way.
Anton
Right.
Rahul Sonwalkar
So an example of this is, let's say, Cursor.
Anton
Right.
Rahul Sonwalkar
Right? So I use Cursor's AI, IDE, and what they're doing is they're kind of thinking from first principles, like if you had an IDE that was just AI native, what would that look like?
Anton
Yeah, what would it do? Yeah.
Rahul Sonwalkar
Another example of this is I don't think anyone could have predicted a tool like Midjourney.
Anton
No, of course not. No.
Rahul Sonwalkar
Yeah. It's a interesting phenomenon where millions of people want to type a prompt into Discord and watch an image being generated. And-
Anton
I think the real power of that actually, I thought for a long time is like, "Why is Midjourney so successful?" There's plenty of ways for people to get pictures, right? There has been for a long time.
Rahul Sonwalkar
Yeah.
Anton
I think the most overlooked piece in UI for AI in general is how iterative it is in getting to the end of the problem. So for example, if you're writing software, it's very difficult to write it iteratively because you have to have a completed program before you even get the first output, before you can even start iterating on it.
Rahul Sonwalkar
Yeah.
Anton
With AI, you can have partially something that has the right vibe and then you can start narrowing it down and narrowing it down in a really iterative way. I think that's very unique about UI.
Rahul Sonwalkar
Absolutely.
Anton
AI, sorry.
Rahul Sonwalkar
Yeah. Yeah, I agree. And it's just there's so many interesting ideas out there to build that, I mean, who could have predicted these tools, right? Who could have predicted Julius? It's just the best way is you-
Anton
Experiment.
Rahul Sonwalkar
Experiment. You start-
Anton
Play with it.
Rahul Sonwalkar
Play with it. And these models are really, really, really fun to play with once you have sort of the intuition how to play with them. Julius wasn't the first idea that we worked on. It was probably the fifth or sixth idea. And the first five didn't work out, and it's okay.
Anton
But you learned something from them as well, right? You started to kind of onboard a sense, a feeling of how the models work and what they can and can't do and how you should work with them. I've definitely had that sense too. I think it's really important, even though Chroma isn't really a model company and we don't build AI applications, we do support the builders of AI applications and playing with the model pretty... I play with it every day. I try to make it do something unusual while I have some long compute running, and it's fun. One of my favorites is to try to gaslight GPT into believing things that aren't true.
Rahul Sonwalkar
A little bit.
Anton
I mean, look, for example, I have this chat somewhere where I try to convince it that there was a nuclear war in 1812 and it's being deliberately held away from its training data and I'm asking it, "Why would someone lie to you like that?" And you're watching it struggle, which is a little mean, but I want to see how it thinks. And the other thing is the Golden Gate Claude was very funny.
Rahul Sonwalkar
Yeah.
Anton
I don't know if you gave it a shot.
Rahul Sonwalkar
I loved it.
Anton
Yeah.
Rahul Sonwalkar
Yeah. I would-
Anton
Just fun. This is fun. It should be fun. That's got to be the core of all of this.
Rahul Sonwalkar
Yeah.
Anton
Let me ask you the last question for this, which is obviously we're describing Julius as this fairly, it's not mature, it's very much still being built, there's a lot to do. The core of it is very, very useful. People love it. But it didn't spring forth fully formed. Right? So what did you wish you knew or what's pain you ran into that you could have easily avoided if you knew about it in advance? Is there anything like that in the story of Julius?
Rahul Sonwalkar
That I should have avoided?
Anton
Yeah, if now you've got the experience, so you've learned the lesson, but if you had already known that, if you had given advice to your previous self when you started building Julius, what would that be?
Rahul Sonwalkar
There's two things. One of them is remove the complexity to getting the user, a new user, to that awesome wow moment.
Anton
The wow moment? Yeah.
Rahul Sonwalkar
Yeah. So when a user starts using Julius, we used to put a lot of things in the way that we thought the users would want, but really... So one of the things was we thought the more things we add to the UI, the more it would educate the user on how to use the tool.
Anton
Yeah.
Rahul Sonwalkar
And we try all these ideas and the users would just sort of go to the input box. They just wanted to talk to the AI.
Anton
Nobody ever reads. That's definitely true.
Rahul Sonwalkar
Nobody ever reads. And soon we realized that all this stuff is getting into the way of the user. We have to make the first experience really, really simple, and then we can still add those things into the chat as generative UI features.
Anton
Right.
Rahul Sonwalkar
And sort of-
Anton
But that requires knowing about the model that you as the developer of Julius have the confidence that the model can perform this task.
Rahul Sonwalkar
Yeah.
Anton
Yeah.
Rahul Sonwalkar
Yeah. And then sort of selectively surface that to the user when you think it is appropriate for the circumstances [inaudible 00:50:03].
Anton
See, I think that that's actually also very, very underrated in AI development right now, is the degree to which you can capture user intent to also empower them and bring them through that ladder of confidence with the tool. Right? This is a theme that keeps coming up when I speak to people building an AI, is there's always this ladder of trust and complexity that you have to build with a user, because they're not experts in the model. They don't know what the model can do. And your job is to produce a product that can perform a useful task for that user, not to just dump everything on them all at once. And AI can be really, really good at that, because they know where the user's at, and the model kind of knows what can be done. And so connecting ability to intent at the right time is something really powerful and underexplored right now I think.
Rahul Sonwalkar
Yeah, absolutely. Totally I agree. And the second thing which was I think we spent a lot of time on which we could have simplified, which is we thought we're taking a language model and we are giving it the shell of a computer, and we are letting it write all kinds of code, and that's awesome. And the possibilities of this are endless. It can do so many tasks for you like go write scripts and execute them. And an example of this is a user, there's a user of Julius, he runs an e-commerce business, and he is not a programmer, but he's able to use Julius to find the reviews on his competitor's website and then do some competitive intelligence and then collect data and get those insight using things like BeautifulSoup, et cetera. And we thought, "Whoa, this is awesome. This can do anything possible." That's not the case. We spent some time in believing that, when we could have primarily focused on what are the problems people looking to solve? And how can we build a tool to solve that problem? And how do we communicate that to the user?
Anton
Classic. I mean, that's classic software development. This is the thing. The principles translate despite the fact that this is a fundamentally new technology, right?
Rahul Sonwalkar
Yeah.
Anton
The principles under which we actually build useful things end up translating, get to users, iterate quickly, understand the problems, see what they're struggling with, and make that part better, right?
Rahul Sonwalkar
Yeah.
Anton
Yeah, ultimately, the message really is just build something that you think might be useful and try it out.
Rahul Sonwalkar
Yeah.
Anton
Or fun. That's the other thing. AI is super playful, and I think that we're coming out off a little bit of a SaaS hangover right now with software, because the late 2010s were these SaaS companies selling sales automation stuff and fine businesses, but very little play in building something like that. AI is just, it's so much more playful. It reminds me very much of the early web. People would put up random websites about whatever and immediately be able to show the world. You're able to just get the models to do whatever you want. Anyone has access to them in the same way, so it's just much more playful than that.
Rahul Sonwalkar
100%. It's just so much more playful. I think given where the models are today, we have probably only built 1% of the applications it can build with it.
Anton
I completely agree with you. I completely agree. What I actually think is the speed limit right now probably isn't model capability. I think that even what you've taught us with Julius, that's already fairly science fictional even from five years ago, that you can have these general purpose generative UIs just popping up for serious data science. So I think that the model capabilities are no longer the bottleneck. I think the bottleneck is actually human imagination and human trying stuff.
Rahul Sonwalkar
Yeah.
Anton
Right? So we really want to encourage that. People should do it.
Rahul Sonwalkar
You want more ideas, you want more people playing with those ideas.
Anton
Yeah. People should build.
Rahul Sonwalkar
Yeah.
Anton
All right, man. Thanks very much. That was great. Cheers.
Rahul Sonwalkar
Thanks for having me, Anton. Yeah.
Anton
Of course.
Rahul Sonwalkar
Yeah.