Kalan Chan on building Sourcegraph's Cody

Anton

In comparison to traditional software where you're like, "Big bang, it's got to be right or wrong," AI, you can really get closer and closer to the thing that you're looking for.

Kalan

We decided, "Okay, we'll strip it down and give more powers back to the users," and we went through this whole chat redesign where we decided, "Okay, we're going to create a chat where users can go in and we're going to guide them on what they want to use as context."

Anton

I'm pretty excited about this one, because both of our companies are so developer-focused. Chroma is specifically focused on people building AI applications and Sourcegraph is just developer-focused in general.

Kalan

Yeah.

Anton

To talk about how developer-focused companies are building with AI is very exciting. I think there's a lot of really interesting lessons to learn here.

Kalan

Yeah.

Anton

Why don't you introduce yourself? Talk a little bit about some of the stuff we've built with AI in Sourcegraph and we'll take it from there.

Kalan

Yeah, sounds good. Hey, everyone. My name's Kalan, I'm currently a TPM at Sourcegraph. In my previous experiences, I've worked a lot in developer infrastructure, cloud infrastructure, so working alongside with GCP, Kubernetes, CICD, deployment automation, stuff like that. Also have a good chunk of experience working on full-stack applications for large enterprises, so integrating with their management systems and a lot of their internal, stuff like that. Currently, I'm more focused on the program and product management space at Sourcegraph. I'm on the Cody core team where we pretty much take care of all the client-facing experiences, so as soon as you open up chat to sending your first inquiry to getting a response back, we take care of all of that.

Anton

Great.

Kalan

Yeah. Most recently, our team at Sourcegraph, we've shipped Cody to a lot of editors, VS Code and JetBrains, Niobium, and we brought Cody to GA multiple times as well, and overall, it's been a lot of fun just building AI and making the Cody AI space a lot more developed for all our users. Yeah, it's been great.

Anton

Great. Let's take a step back and actually talk about what is Cody?

Kalan

Yeah. Cody is our coding assistant at Sourcegraph. You can think of it like a super-powered pair programmer for any developer in your IDE, so it lives right alongside your IDE of choice. We currently have it for VS Code and JetBrains, also have it available on web too if you want to go to sourcegraph.com. Cody is able to pretty much do everything that you can do with LLMs, like ChatGPT or Anthropic, but we super power it with our context engine behind the scenes and giving you just better user interfaces to add context or add the code snippets that you want, and being able to edit code. Like write in your editor, instead of always having that copy and pasting a dedicated chat browser, like ChatGPT.

Anton

You said all of the things that Claude and ChatGPT can do for you. What are some of the things that you see people really use Cody for?

Kalan

Yeah. The power of Cody really lies in between how it fetches contacts and it knows your code base. Claude, ChatGPT is great as a single input, what you get in is what you get out, but Cody can help you go and fetch snippets that are relevant to your question. If you go into a code base and you ask it, "Hey, how is auth defined?" You would never be able to do that with a standalone chat browser like ChatGPT, but with Cody, it's able to go and find snippets of relevant information in your code base on what auth means and bring that into the chat, add it to the context window. We also make it a little easier for developers to add those kinds of interactions by... You can add mention your files, you can add symbols, functions, all that stuff.

Anton

That's essentially all core Sourcegraph functionality, right? Sourcegraph is all about code search and navigating code. Why don't we start there? Cody has context on your code, which is what allows it to be powerful, because it's always working with the right information at the right time. Let's talk about integrating something that we can say is, let's say, a more traditional search product, like Sourcegraph, with an AI assistant like Cody. How does that integration look in practice? It's like ChatGPT or Claude calling out to an API, or are you digesting the user's query directly? How does it work?

Kalan

Yeah, so there's multiple steps to it, and honestly, I owe it a lot to our CTO, Beyang. He was the one that pioneered this. What we try to do is you take your query, let's say... Let's go with the same one, "Where is auth defined?" What we would do is we would send that to an LLM and do keyword expansion, so we would ask the LLM, "Hey, out of this query that I just sent you, what are the keywords that I should pull out and can use for a nice search input?" We take that response from the LLM, let's say it's three words, like auth... Actually, let's say it's one word, just say "Auth."

Anton

Okay.

Kalan

Then what we do is we run that keyword through the search context engine. For enterprise, we use Sourcegraph search, and when you're local in your editor, we just do a symbol match and try to find auth in your code base.

Anton

You're directly integrated into, say, VS Code or whatever VS Code's actual symbol finding engine is in the language that you happen to be working in, right? Okay.

Kalan

Yeah.

Anton

That makes sense, but auth typically isn't a symbol associated with implementing auth, right? What happens next?

Kalan

Yeah, so that's where the expansion comes from. We try to find matching symbols. If it doesn't, then we can also use similarity search, like a vector search.

Anton

Like what we do.

Kalan

Exactly, like Chroma. Yeah. We don't do it exactly the same way as Chroma. We build it in-house, we don't have a vector database under the hood, we save everything locally and we run similarity search across flat files. Yeah, and then we try to pull as much context as we can from those keyword. You're right, auth is maybe a bad example in this case, because it's so broad and-

Anton

But it's a good example, because I want to know what happens.

Kalan

Yeah, exactly. We have those fallback mechanisms. If we can't find a right symbol for it, then we try to find other existing files with similarities close to it.

Anton

Are you doing similarity search at the level of a file, at the level of a function? How is that broken up for... Similarity search for code, right? How do you actually break that up functionally?

Kalan

Yeah, so we have chunking mechanisms.

Anton

Yep.

Kalan

We take it at the code level, where everything's basically a string, like a text, and then we chunk it in sizes that try to match against the symbol that we want.

Anton

I just want to make sure that I'm understanding this correctly. Let's go back to the auth example.

Kalan

Yeah.

Anton

We do a symbol search from the keyword expansion, and then if we don't find the symbol, we do a similarity search. And the similarity search is over chunks that we've pre-computed over the repo, right?

Kalan

Yeah.

Anton

Two questions immediately arise. One is, when do you do that chunking? Is it when people install Cody? How does the chunk stay up to date?

Kalan

Yeah.

Anton

Yeah, how does that stuff work?

Kalan

Yeah, yeah. As soon as you log into Cody and it recognizes a repository, we'll start indexing, and there's going to be a little spinner at the bottom of VS Code that says, "Your code base is indexing, Cody is doing something," and periodically, we'll refresh those indexes to make sure that code's up to date.

Anton

It's periodic, it's not in response to how I'm editing? It's not like a callback on a file save?

Kalan

No, I believe not. I can double-check with my engineering team, but from what I understood was it takes a lot of work to do that fast re-indexing.

Anton

Not if you use Chroma.

Kalan

That's a good plug, yeah.

Anton

I'm serious, we should talk about that, because if that's what you guys are doing, we 100% support that, and of course, we deploy locally. You mentioned that you are also using the enterprise version of Sourcegraph Search. Is that in place of similarity?

Kalan

Yeah, yeah. What we found is that it's complicated to deploy a... I'm sure you know, it's complicated to deploy an embedding service for an enterprise. They have a lot of questions around security, and where are the files being sent to? Yeah, it's just really hard, so we decided like, "Okay, we have this powerful search API already, why don't we just try to modify it a little bit to understand keywords and natural language better and bring code snippets back from the code that we already have indexed that's super secure, that's on hardened machines, and then provide that as context for Cody?" Instead of doing a similarity search.

Anton

Gotcha. That was the original Sourcegraph product, right? It was essentially this enterprise-grade code search, right?

Kalan

Yeah. Exactly, yeah.

Anton

That must be a pretty sophisticated piece of software like searches this... You add more and more to it, and once it's really nailed down, it's just really accurate and precise. We've done query expansion, we've got some keywords, we've done some similarity. Do you do similarity on the keywords or similarity on the entire query? On the entire query, gotcha. My next question then is, okay, you get some results. Are they relevant?

Kalan

Yeah.

Anton

How do you decide if they're relevant or not for Cody?

Kalan

Yeah, so that's actually the hard piece that we've been battling for the last year.

Anton

Your whole hard piece.

Kalan

For the last year, maybe even two, we've always... It was actually really bad when we first launched Cody, because there was a lot of hype and expectations around what LLMs could do, and we didn't have an answer to when Cody could produce a bad result or when it hallucinated.

Anton

Yes.

Kalan

We decided that, "Okay..." We still do, we have guardrails in place to make sure that what information you're pulling in is not exact copyright from the open source community, but we don't have a super strong way to say, "Hey, this is completely false," because we don't... I feel like it's very hard to catch all those scenarios where Cody is making something up, because it hallucinated, or the LLM model behind the scenes is not strong enough to understand what's going on.

Anton

As we talked about, the flexibility of LLMs is a double-edged sword, because it allows us to process unstructured information, but at the same time, the information is unstructured, it's hard to tell if it's doing the right thing, but stepping back from the final output, how do you decide whether the results are relevant or not? Or we just take whatever results we get?

Kalan

We're just streaming the results back.

Anton

Okay, so whatever the results are, and we don't filter them afterwards.

Kalan

Yeah.

Anton

That's really interesting, and then Cody takes the in-context results and it takes the user's query and it tries to answer the query based on the in context results?

Kalan

Yeah.

Anton

Let's talk about some of the guardrails, because I know you guys must be experimenting with something. How do you try to gate some of Cody's output?

Kalan

Yeah, so this is actually a very relevant question for enterprise security.

Anton

Yes, of course.

Kalan

Yeah. A lot of enterprises, they come up to us and say, "Hey, we don't want our developers copy and pasting code that they found on Stack Overflow," and stuff like that, so what we've done as an enterprise solution is, because we have so much open source repos index, we try to match the code generated from the LLM against anything that we see in our open source index.

Anton

Gotcha.

Kalan

Yeah.

Anton

It's like a direct match? Like an exact character match, or is it a little bit fuzzy?

Kalan

It's fuzzy. I don't know exactly how the API works, but it reads it in chunks, so if there's 20 lines in there and there's 15 lines that's completely copy and pasted, then we'll send it along.

Anton

Gotcha, gotcha. And, of course, you can use the enterprise Sourcegraph Search to look for that duplication as well.

Kalan

Yeah.

Anton

Yeah, it makes sense. I mean, it's really interesting. Software copyright and open source, on the one hand, it's like being litigated and settled so many times, and on the other hand, it's not really, because at the individual developer level, there's nothing really stopping you, right?

Kalan

Yeah.

Anton

But as an enterprise, you really need to be defensive with stuff like that. That raises another interesting point, which is, obviously, you've got hallucination as a problem, but you've also got regurgitation as a problem, where it directly reproduces some of its training data, which is actually, in many cases, not a problem, but in this case, it is.

Kalan

Yeah.

Anton

That's really interesting. What other guardrails? Let's talk about maybe Cody's prompt.

Kalan

Yeah.

Anton

And I want to get into also the backing LLM, but let's talk about Cody's prompt. Do you have Cody ever say, "I don't know."

Kalan

Yeah, okay. I want to dive down a little story around it.

Anton

I want to hear the story, let's go.

Kalan

Yeah. You know how LLMs very... A lot of them can be very defensive, a lot of them can say, "I don't know," or, "I apologize," and stuff like that.

Anton

Yes, Claude apologizes whenever I ask it to do anything extra from what I originally.

Kalan

Yeah.

Anton

It's like, "You're making me feel like I'm abusing you, you don't need to apologize." Anyway.

Kalan

Okay, so it's funny you brought that up, because that was actually the exact complaint that we heard from our customers. When we first made our default model Claude 2.1, I believe Claude 2.1 was super apologetic, kept on apologizing for everything, and we heard so many customer complaints about, "Hey, Cody doesn't work. It keeps on apologizing, doesn't understand anything," and after we did a bunch of evaluations, we realized that Claude 2.1 was super hedgy. It likes to hedge its answers.

Anton

Yes.

Kalan

And we decided, "Okay, even though it's a more powerful model, let's just roll it back to Claude 2," because Claude 2 didn't do any of that, and at least it could give the user a starting ground of where to go and give it some explanation of what's happening. We actually saw that with Claude 3-5, too. We were evaluating whether or not we wanted to make it our default model for all edits and chats, and we-

Anton

Can the user select which model they want?

Kalan

Yes, we can.

Anton

Great, yeah.

Kalan

We can. I don't remember if you can set the default. I believe our default's always going to be 3.5, but the user has the choice to choose whatever.

Anton

Yeah, gotcha.

Kalan

Yeah, we were evaluating 3.5 on whether or not we should make it our default model, and we ran it against Claude 2 and 3.5. No, sorry. We ran against Claude 3 Opus and we realized, "Hey, 3.5, it's really smart, it's really good, but it apologizes a lot," and we even asked the Anthropic team like, "Hey, what should we do here?" Circling back to your original question, so their guidance to us was, "You should just prompt engineers and just say, 'Hey, be strict and firm with your answers and don't apologize so much,'" and that's where we're at now.

Anton

Did it work?

Kalan

Yeah, it worked.

Anton

Yeah, it worked.

Kalan

But worked in a certain regards that you had to stick it in the end of the prompt. It was weird.

Anton

That's really interesting, and I think I've really only seen Anthropic provide guidelines on prompt structure. For example, retrieval. They suggest putting retrieval results closer to the start of the prompt and instructions later, just as you said, which is interesting, because that must be how they're training it.

Kalan

Yeah.

Anton

Claude is the default, people can choose their own model.

Kalan

Yeah.

Anton

You've talked a little bit about evaluating stuff. Evals are the big open question, I think, for everybody right now, don't they?

Kalan

Yeah.

Anton

How do you do it? What's happening in Sourcegraph for these evals?

Kalan

Yeah. All right, so I just want to say there's nothing super scientifically groundbreaking.

Anton

Literally nobody has anything super scientifically groundbreaking, but everyone needs something practical.

Kalan

There is that user on X, though. I see him post a lot about his evals. I got to find it.

Anton

There's a few people who are getting more sophisticated, but I think the other thing about evals is they'll shift according to our use cases over time. But anyway, I want to hear about Sourcegraph's.

Kalan

Yeah. I completely owe all this to Julie, one of our engineers at Sourcegraph. She was really the one that pioneered and pushed the evaluations. She basically came up with this evaluation framework where you compare it against different models, but we have a couple preset questions in there that are super relevant to Sourcegraph and our user base, so stuff like, "Generate certain code for me using this style in this repository," stuff like that. Yeah, we basically take the two models, we push the same prompts to it, and we evaluate it based off of accuracy that we feel is good and a happy path that Julia set up.

Anton

How do you evaluate accuracy that you feel is good? How does that happen? Is it like a bunch of people on the team just go over it and give a thumbs up, thumbs down? Is that literally it?

Kalan

Yeah, exactly. If you're a staff engineer, you can be very nitpicky on how you do things, but if you're an IC or a level-one engineer, you just want to get the job done. Different people have different styles to that, and that's why I say we don't do it very scientifically, because a lot of it is based off of vibes.

Anton

No, but that's the reality. That's really true. In fact, taking a, let's say, scientific perspective on this, most of our benchmarks for evaluating LLMs in general get saturated fairly quickly, in part because we know that we want them to perform well on that particular benchmark task, so the next iteration of the model is trained on things resembling that benchmark task, but vibes is the thing that really tells you, "Do I want to be using this model or another one?" You mentioned UI, and we will circle back to Cody's UI, do you have interesting UI set up for your in-house evals? If you're asking IC engineers to evaluate Cody's output, is there an easy way to give thumbs up, thumbs down? How are you guys doing that?

Kalan

Yeah. During a Cody hackathon maybe a month or two ago, Julie and... We have another staff engineer, his name's Olaf, they literally just built this react app, and we're just pushing data into it. Then you upload your data, you select the models that you want to evaluate against, it goes off and does the queries and brings back the information, and then we just say, "This is good, this is good."

Anton

That's perfect. Yeah, that's great. Before founding Chroma, I spent many years working in machine perception, and ultimately, the design of the human labeling pipeline is so important for getting this right, and I think what's going to happen is, previously, only if you are a company making self-driving cars, did you need to think about your evals pipeline? Now, because everyone's building with AI, everyone needs their eval pipeline.

Kalan

Yeah.

Anton

All those lessons that we spent a long time learning in robotics seem like they're about to get relearned by the entire industry, and hopefully, we can get ahead of that, because we really want to help developers make that super easy. We want to build all kinds of evals into Chroma.

Kalan

Yeah. But how do you integrate that? Are you suggesting we should crowdsource evaluations?

Anton

The example that you gave is really good, where you have a tool where your team pushes data, it produces results, and then you evaluate those results. You, as a team. I'm not suggesting that we do crowdsourced evals, because again, it's proprietary data, but we can ship tools along with our AI frameworks or with our retriever, like Chroma, that make it easier to work with those tools. It's a new way of building software, which is what this whole thing is about.

Kalan

Yeah.

Anton

If you watch our previous episodes, everybody gives the same answer as you.

Kalan

Yeah.

Anton

It's like, "We're not really sure what we're doing, but we found that this works," which is why it's one of my favorite questions in this thing.

Kalan

Yeah, yeah.

Anton

Okay. Have you seen regressions? Has your eval suite caught regressions in either what you're doing or what the models are doing?

Kalan

Yeah, yeah. I think the hedge example is still very good on regression.

Anton

How long ago was that, out of interest?

Kalan

That was a month or two ago-

Anton

Okay, pretty recent.

Kalan

Yeah.

Anton

Claude's hedging was a problem?

Kalan

Yeah, yeah.

Anton

Yeah.

Kalan

We realized that, because we didn't like some of the answers that it produced when it was apologizing. Yeah, catching regressions, it's quite hard, I would say. It's very hard. There's still times where we get bug reports saying, "Hey, this doesn't work," and it used to. We're like, "Well, really don't-"

Anton

Something changed. Yeah, we don't know what changed. Right.

Kalan

Yeah.

Anton

Yeah. I don't know, I feel like if we're going to build robust products on top of these APIs, we need something resembling a change log from the API provider, because they typically just bump the version and be like, "Here you go."

Kalan

Yeah, yeah. Or what they train their data on can be super different, too.

Anton

I think they have some important legal reasons to not tell us. Okay. Are you doing purely human in the loop evals? Have you tried using, for example, a more powerful model to evaluate the output, see how well it correlates with humans? Have you gone down that path or not yet?

Kalan

Yeah. Actually, that's a good point. On top of the human evals, we also have a prompt or LLM judge, I believe.

Anton

Yes.

Kalan

Yeah, we use an LLM judge, but we just use it for a secondary opinion.

Anton

Have you calibrated the judge to the human preferences?

Kalan

I don't know about that.

Anton

Okay.

Kalan

I should ask Julie.

Anton

Yeah. I mean, that's a pretty important step that sometimes gets missed, is you need to make sure that the LLM judge has the same opinion as you do. Now, in my head, I'm thinking you can guide the LLM judge in the same way that you can guide the actual product model itself, and you can give it more instructions about how you want things to really be, but that's a whole different quagmire. It's fun and exciting to be working this out for the first time, I don't think anyone has a great answer.

Kalan

Yeah, yeah, and it's a hard problem to solve.

Anton

Yeah. Very hard.

Kalan

It's very hard.

Anton

It's very hard, and you have to also understand what good enough looks like, because you can always chase perfect forever, and at some point, you need to ship a product that people can actually use.

Kalan

Yeah.

Anton

You need to be really conscious of that decision a lot of the time. Let's go back to talking a little bit about Cody.

Kalan

Yeah.

Anton

You mentioned that Cody's got these unique UI elements. What's special about the UI in Cody that makes it work as an AI product?

Kalan

Yeah. The design philosophy changed about two, three months ago. We realized that Cody was getting too complicated and it was two Blackbox magic, and we didn't want to go down that path anymore, because one, it made troubleshooting really hard. We didn't have enough testing in place to test all these black magic scenarios. And two, it gave users very little feedback on what's going on behind the scenes, like, "What's wrong?" And why this is a problem.

Anton

Yeah.

Kalan

We decided like, "Okay, let's just strip it all away. We're going to make it super simple, we're going to try to be a thin wrapper around the LLM," but-

Anton

Okay. I just want to pause here, I'd like to understand, what's that complexity? Was it just like the length of the prompt was really long? Was it that you were catching a lot of edge cases for users? What was in that complexity that you stripped away?

Kalan

Yeah, okay. Great question. Let's go with one example. The highly-used use case for LLMs in general is, "Explain this code."

Anton

Yeah.

Kalan

What we found was that when users clicked, "Explain this code," they didn't know the prompt behind it. We could have said, "Explain this code in the most sophisticated way and make it sound beautiful in like Shakespeare," or something, and then the user would have no idea what's going on there. They would just get a response, and whether or not the user liked it was completely dependent on us.

Anton

Yeah, sure.

Kalan

Well, we thought about that and we're like, "Okay, maybe we shouldn't hide all the prompting strategies in the-"

Anton

Interesting.

Kalan

What we decided to do was, when you click this command called "Explain code," we'll print out the entire prompt for you and you can review it. Well, we'll send it off, but you have a chance to review it one more time.

Anton

Can you edit it?

Kalan

Yeah, you can.

Anton

Nice.

Kalan

Yeah. It's in the chat input and the user can review it if the-

Anton

Just like it gets saved?

Kalan

Yes, there's history.

Anton

Between calls?

Kalan

Yes.

Anton

If I ask for, "Explain this code again," it'll use my prompt?

Kalan

Sorry. No, that part, it's fixed on our side, but every time you click "Explain this code," you'll create a new chat input that you can edit.

Anton

Gotcha, gotcha, gotcha.

Kalan

Yeah.

Anton

Nice, makes sense. Okay, but before that, you were hiding the prompt, you were hiding... What other stuff besides the prompt were you hiding before?

Kalan

Yeah, so before, we had this magical button called "Enhance context," and that was a crazy one. Sorry. The whole flow of retrieving context and using it as context for the LLM, it's pretty well known across the developer community, but to a user, let's say, in a Fortune 500 enterprise, they have no idea what's going on.

Anton

I would even say even outside the Fortune 500. I mean, again, probably something like 90 to 95% of all software developers have never built anything with AI.

Kalan

Really? You think so? Okay.

Anton

That's been my experience when I go out and talk to people, and part of the point of this series is to get more people building with AI.

Kalan

Yeah.

Anton

What even "Enhance context" means is a hidden mystery.

Kalan

Yeah, exactly. When users were clicking this button, and we get a lot of questions, even on GitHub, "What does this button do?" We try to explain it in the docs, but still, it's hard, because we're mixed matching like symbol search, we're mixing in the similarity search into the context window.

Anton

Yeah, so that you didn't have to explain any of that, you wrote "Enhance context," right?

Kalan

Yeah.

Anton

Yeah.

Kalan

Exactly. We realized that's probably causing a lot of the confusion for our users, so we decided, "Okay, let's strip it away, let's not try to be so magical and this one-stop-shop tool that everyone's dreaming about. We'll strip it down and give more powers back to the users." We went through this whole chat redesign where we decided, "Okay, we're going to create a chat where users can go in and we're going to guide them on what they want to use as context," but they first have to understand what context is. That is the prerequisite of everything.

Anton

Yeah. You're like walking them up the ladder of complexity, which I think is really clever and the right way to build.

Kalan

Yeah, exactly. What we did was we took that away and then we thought, "Okay, first, if you're in a file, we're going to just attach that file as a little context chip in the chat, and then we're also going to attach the repo as well."

Anton

When you say "Attach the repo," like the name of the repository? What do you mean, attach the repo?

Kalan

We can also do search across the repo, and we created a little context chip like that as well. A little indicator in the chat that you're using your repo as an information source.

Anton

Gotcha.

Kalan

Yeah, and we made them as all tokenized items. The user can delete as many as they want, they can add as many as they want, but the whole philosophy is, "Let the users decide and add more freedom to what they want to add as context." We'll give you a small little nudge, but we won't give you everything that you need.

Anton

Gotcha. Is that proactive from the user's perspective, in the sense that they say, "Hey, yeah, I want Sourcegraph to search and then pull in the context"? Or is it more reactive, where it's like, "Here's all this stuff we found, which ones do you want to include?"? Which way does that work?

Kalan

It's the former.

Anton

The former?

Kalan

Yeah, yeah. We'll still do the search for you.

Anton

But you have to tell us.

Kalan

Yeah, but that stuff is interesting, though. That'd be like multi-term conversation.

Anton

This is something that I like to bring up and I think is under-explored. Well, there's a bunch of people actually working on this, and so we've seen that pattern emerging quite a bit, and it's interesting to us, because we're used to... Because of the rag paradigm, which makes me sick to say, I hate rag, I hate that phrase, we are starting to get used to the idea of providing LLMs with additional context from data that we have laying around, but I think the thing that's coming next is providing additional instructions in an iterative way, and then storing those instructions and then keep passing them as context as well, which is really exciting and we're starting to see it.

Kalan

Yeah.

Anton

Some people are already using it in production, too, that way. Yeah.

Kalan

Okay, you hit a really interesting point there. Steve Yegge, do you know who Steve Yegge is?

Anton

Yes.

Kalan

Yeah, so Steve has been coining this term CHOP for a while now. It's called Chat-Oriented Programming.

Anton

Yes.

Kalan

Exactly like what you said, he's been trying to push this idea that, within chat, you should be iterating every single time and you should feed in instructions and talk to it almost as if you're giving orders until you get the final response that you want, and we have a feature like that too. It's called Smart Apply, where the LLM will generate code and you can keep feeding in instructions until it gets to the point where you want it, and then you can just hit "Apply" and then it just funnels into the file that you want it to.

Anton

Yeah, and I guess one step beyond that is just saving the results of that.

Kalan

Yeah.

Anton

One thing that we've seen people play with is, besides retrieving relevant data, is, okay, this instruction chain that previously is related to the current task, and you can do that through similarity search or something a little more complex than that, but we're starting to see it, and that's really cool. I've actually heard that story not many times, where people start off with this complex shell and then they actually realize, "No, we should just strip everything out and make it as simple and interactive as possible," and obviously, that's a form of surprise, because you keep trying to have to shore up the model. What else was surprising in building a product like Cody?

Kalan

I would say another really difficult thing that we've been tackling is how custom enterprises want with where they want to pull their models and which service provider they trust.

Anton

Yeah, I was going to ask. At the enterprise level, do you find people are running their own inference or running their own models yet?

Kalan

Yeah, we have a lot of demand for self-hosted models.

Anton

When you say demand for self-hosted, obviously, you don't mean that Sourcegraph is going to host them, you mean that the enterprise is running a self-hosted model and they want you to plug into it?

Kalan

Yeah, exactly.

Anton

Why do they do that? Is that for security? Is it for cost? Is it just because they want control, so that they can iterate? What is it for?

Kalan

Yeah, I would say it's a good mixture of it. Sometimes we see customers that, let's say, they have GCP credits and they're like, "Hey, we only want to put on GCP, because we have a multi-million-dollar deal with them," so that's one example. Then there's other examples like, "Hey, we got this super cool AI team that's experimenting with this model. Hey, could you plug Cody into it and see what it looks like?" All of those customizations, they're like death by a thousand cuts, because we can do something right for one customer, but then can't really translate it well for another customer.

Anton

What's an example of that?

Kalan

A completions endpoint.

Anton

Okay.

Kalan

If one customer decides, "I'm going to make my response slightly different than how Anthropic does it-"

Anton

I see what you're saying. Because they have slightly different interfaces, plugging Cody into all of them is a pain in the ass.

Kalan

Yeah.

Anton

I get it. I see now, because of course, look, frankly, the APIs of pretty much every LLM provider are still pure insanity to me. They're very AI researcher-brained, they're not very developer-focused, and so I can see why that would be a friction, right?

Kalan

Yeah, yeah. It's hard. Even Open AI, Anthropic, I would say they're pretty similar now, but there's still slight differences.

Anton

See, this is a task where I would suggest using AI, actually, to write the integrations. No, because look, LLMs, at their core, are good at a few core tasks, which turn out to be very general.

Kalan

Yeah.

Anton

One of the tasks that they're good at is translation, and this is a translation problem.

Kalan

Yeah.

Anton

It's just taking one interface and turning into another one, so if they have a specification for their API and you're like, "This is what Cody provides today, write a translation layer," I think that probably they would get 95% of the way there most of the time.

Kalan

Yeah, yeah. I think so, too. It's gotten a lot better reading JSON, too.

Anton

Yeah, it has. I was going to ask about tool use. Right now, it sounds like you still have the search and retrieval stuff as parallel infrastructure, right? Claude doesn't directly call the retriever, but Claude does have tool use, and so does ChatGPT and everyone. Have you thought about integrating retrieval more directly with Claude? Rather than having to do query expansion, extracting that out, running the search, have you thought about Claude being like, "Okay, call the retriever using this tool," and then providing it with the results?

Kalan

Yeah.

Anton

Experimented with it?

Kalan

[inaudible 00:32:34] more agentic coding? Yeah.

Anton

I mean, see this is the thing. Everyone got a different definition for "Agent," so I like to be specific. What I mean by that is using the tool use API, basically making Claude aware that you have this function and what its arguments are, and then Claude can then call that function. You can give Claude the results.

Kalan

Gotcha. I actually don't know. I know our AI team, they're experimenting with autonomous coding and we have this one feature at Sourcegraph called "Batch changes," where you give it a spec and it can run across a code base and do changes for you. I think that's one of the most probably realistic things that we can do with AI right now, LLMs calling functions, but nothing like concrete down on our product roadmap. Yeah.

Anton

Sure. I was just curious if people are starting to integrate retrieval and search in that tool using [inaudible 00:33:24], because you totally can, right? You could call chroma.query and then populate.

Kalan

But is it deterministic enough? That's my biggest word.

Anton

You want it to not be deterministic almost.

Kalan

Really?

Anton

Because what's the point of doing, for example, the keyword expansion that we described? The point of asking the model to do keyword expansion is because we want it to be fuzzy, we want it to use this general knowledge of things that are associated with this type of thing to generate more keywords or to... One thing that's popular is full-blown query expansion where it basically, rather than taking the query as worded, first, it might ask more detailed-related questions or it might reword the query in several different ways, and then send that to the similarity search. Models are good at that, they work pretty well, so the non-determinism is part of what we want.

Kalan

It's interesting.

Anton

In terms of determinism, actually, obviously, the models have temperature settings. Are you running Claude at T-0 for Cody?

Kalan

Yeah.

Anton

Yeah. Right, T-0, because they're T-1 by default in the chatbot.

Kalan

Are they?

Anton

Which I found very interesting. Yes.

Kalan

I didn't know that. Yeah.

Anton

In the chatbot, they're T-1 by default, which makes a lot of sense for the chatbot. You want it to be different, basically. Take it down different paths. Cool. A couple of follow-ups. One is you've been on this journey for a little while. What advice would you give someone who's just starting to build with AI? Let's say they haven't really built with the API before and they have some idea or a product in mind. What would you suggest that they do now that you've learned the hard lessons?

Kalan

Yeah, I would say really focus in on what you're trying to solve. I think, for us, in a space where we're trying to solve where everyone can code and having an AI assistant understand all your code or answer questions, it's very broad, it's very hard. A staff engineer that uses Cody is much different from how a junior developer would use Cody, and we were trying to solve a lot of the general cases just good enough. My biggest advice for new users would be to really hone in on what you're really targeting, because it makes the happy paths a lot easier.

Anton

I guess the other thing is where do you see a product like Cody going as the models continue to get smarter? How are you thinking about that part?

Kalan

Yeah. Originally, when we first built Cody and we were getting a lot of feedback that it wasn't working the way expected, I personally thought, "It's probably contingent on how smart the model is behind the scenes," and that turned out to be true. If I compare Cody when it was using like GPT 3-5 from Open AI to now using Claude 3.5, it's like a world's difference. It's just so much smarter, it can do so much more. In terms of what I think is coming, I hope that it can get better at understanding reasoning. Even now when we're talking, humans have a certain way to communicate through body language.

Anton

Yes.

Kalan

If we can somehow get LLM models to understand that part-

Anton

Like emotive content of language? Yeah.

Kalan

Yeah. I would say this whole paradigm of CHOP programming or turn base would become even better, because what I've personally found using Cody and AI's assistance nowadays is you have to know what you're doing beforehand. You have to give it very clear instructions, and it'll take you very seriously. If you want the color red, it'll give you red. Even though if it's super ugly, there's no gut check that it looks good.

Anton

Yeah. What you're saying is they should get better at capturing intent almost, right?

Kalan

Yeah.

Anton

Because, yeah, they do take you super literally. They're really eager to do the exact task. I'm actually exploring how to do this and I'm not sure exactly what I want, and their ability to be a bit more open-ended I think would be really useful. I agree with that.

Kalan

Yeah.

Anton

I agree with that.

Kalan

I miss the whole paradigm of pair programming when you're with someone or with a designer and you're hashing out ideas. I find the LLM models right now, they're not good at giving you ideas to build off of. Like we just talked about, they're very good at taking instructions and doing it.

Anton

Yeah.

Kalan

Yeah.

Anton

Cool. All right, well, thanks very much. It was really informative. I think that there's a lot to build in AI developer tooling. It's one of the verticals in this space that I'm really most excited about, because again, it's huge productivity on logs, right? And we're looking into a lot of then here at Chroma, we're adopting AI processes as much as we can too, so it's always great to hear what's happening on the tooling front.

Kalan

Yeah. This is super exciting.

Anton

Great, thanks for coming in. Yeah.

Kalan

Yeah. Thanks so much, Anton.

Anton

Thank you.