Amjad Masad & Michele Catasta - Building AI For All

What is the future of Software Engineering in the age of AI? Amjad Masad, Founder & CEO of Replit, and Michele Catasta, head of AI at Replit, provide their take on this in this opening keynote presentation from the AI Engineer Summit 2023.

Video Available!

That it feels like feels like a moment. It feels like a historical moment here. My name is Amjad. I'm the co founder of Replit, where we aspire to be the fastest way to get from an idea to a deployed software that you can scale. So I'm going to take you back a little bit, not like Swicks to the 600 Ad, but perhaps to the start of computing you. If my clicker works, it does not work. So, next slide, we're going to get AGI before we get good presentation software. All right, here we go. All right, so very early computers, the ENIAC was the first hearing complete programmable volume and machine computer. The way you programmed it is like you literally punched cards, not physically, but you had a machine that sort of punched these cards. These are sort of binary code for the machine to interpret. It was really hard. There wasn't really a software industry because this was really difficult. It automated some tasks that human computers did at the time, but it didn't create the software industry yet. But then we moved to text from punch cards and we had first assembly and then we had compilers and higher level languages such as C, and then someone invented JavaScript and it's all been downhill since then. But text editors were really or like text based programming was at minimum a ten x improvement, if not 100 x improvement in programming. So we've had these moments where we've had orders of magnitude improvements in programming before and then the IDE became a thing because we had large scale software. This is a screenshot from like 2017 or 18 when we added LSP to every programming environment on replet. So anyone with an account can get IntelliSense, and we're really proud about that. At the time, we're burning a lot of CPU, doing sort of inference, and if you've run TypeScript server, that's like a lot of Ram. But we were really proud that we're giving everyone in the world tools to create professional grade software. About three, four years ago, we started kind of thinking about how AI could change software. It actually started much sooner than that, but with GPT-2 you could sort of give it some code and kind of complete part of it. You're like, okay, this thing is actually happening and we better be part of it. And so we started building and we built this product called Ghostwriter, which does autocomplete chat and all sorts of things inside the IDE. And in just those two years, the pace of progress across the industry, the tools, basically AI was deployed and a lot of different engineers were using it. The AI. Enhanced Engineer, as Swicks kind of called it. Everyone is sort of using these tools. And so we have a world now where a lot of people are gaining huge amount of productivity improvement. I don't think we're at a modular magnitude improvement yet. We're probably in the 50 80, perhaps 100% improvement for some people. But we're still at the start of this, and we think that's going to be ten x, 100 x, perhaps 1000 x over the next decade. The problem, however, Replace mission has always been about access. Our mission is to empower the next billion developers. And so we really didn't want to create this world where some people have access to Ghost Rider and other people don't have access to it. And we started thinking about, okay, what is it? If you really take into heart everything that the AI engineer Conference is about, that we're at a moment where software is changing, where AI is going to be part of the software stack, then you have to really step back a little bit and try to rethink how programming changes. So our view is these programming add ons, such as copilot coding, Ghostwriter, and all these things, we're giving them cute names. We think that's not the way forward. We think that AI needs to be really infused in every programming interaction that you have. And it needs to be part of the default experience of Replit and I'm sure, other products in the future. That's why we're announcing today that we're giving AI for our millions of users that are coding on Replit. And so we think this is going to be the biggest deployment of AI enhanced coding in the world. We're going to be burning as much GPU as we're burning CPU. So pray for us. We have people all over the world coding on all sorts of devices. We have people coding on Android phones, and they're all going to get AI now, so they're all going to be AI enhanced engineers. But as Swiss showed, it's not just about AI enhanced engineering. There's also product. So AI being part of the software creation stack makes sense, but AI part of the call stack is also where a lot of value is created. That's why we have this new product called Model Farm. And Model Farm basically gives you access to models right into your IDE. So all it takes is three lines of code to start doing inference. We launched with Google Cloud LLMs, but we're adding Llama pretty soon. We're adding stable diffusion. And if you're an LLM provider and want to work with us and provide this on our platform, we'd love to talk to you. But basically everyone will get there's some free tier here. Everyone will get free access at least until the end of the year to Model Farm. So you can start doing inference and start building AI based products. So next up, I'm going to bring up my colleague, the head of AI, Araple Mikhaila Katasta, to talk about how we train our own AI models. And we have one more announcement for you coming up. Fine, thank you. All right. Hi, everyone. So today I'm going to be talking about how we're training LLM for code at Replit and I will explain why this weird title. If you've been around Twitter, I think been more than a month ago, you must have read this study from Semi Analysis. And their point was it's meaningless to work on small models trained on limited amount of GPUs. And that came as a shock to us because we had a very good success story back in May where we started to train our models from scratch. And then Hamjad and I and the It started to think, are we really wasting our time here? I'm going to try to convince you it actually is not the case. So our code completion feature on Replit is powered by our own Bespoke Large language model. We train it on open source code, both published on GitHub and also developed by the Replete user base. It's a very low latency feature, so we try to find a difference with Spot compared to what you might be used with other plugins. We try to keep our P 95 latency below 250 milliseconds, such as the developer experience is almost instantaneous, you don't even have to think about it and the code is going to be completed for you. At the model size that we were using, we have been state of the art across the past few months. And let's do a show hands who has heard about our V One model back in May. All right, that feels good. For a second, I feel like an AI star, jokes aside. So we released rapid code V one three B back in May. We got a lot of adoption, a lot of love, and also a lot of contribution. And that's one of the key reasons why we decided to give it back. Rapid history has been based on the shoulders of giants of all the people contributing to the open source space. So we thought we should do exactly the same year. We should give back our model. And today I'm going to be announcing replic code B 1.53 B. So the evolution of the model that we released back in May, let's go in detail, as I'm Judd was saying. So the next ten minutes, we're going to do a technical deep dive and I'm going to tell you how we built it and why it's so powerful. So, first of all, we followed a slightly different recipe compared to the last time. If you recall, back in May, our V One was a Llama style code model, which means we followed a lot of the best recipes that Meta pioneered. Now, we went one level up and we are training up to 300 tokens per parameter. So if you have been following a BD history of LLMs, even in two years ago, most of the models were undertrained, pardon me for the word, it's not exactly technically speaking, it's not correct. But the truth is, mid 2022, the Chinchilla paper from the Mind came out and it was like a bid warning for the old field. Basically, what that paper tells us is that we were under training our models. We should give them way more high quality data, and in exchange, we could train smaller models. So in a sense, we're amortizing training time for inference time, spending more compute to train a smaller, more powerful model, and then at inference time, the latency will be lower. And that's the key insight that we're going to be carrying along this whole keynote today. Now, differently from the V one, this time, we also doubled the amount of high quality data. So we train it up to 1 trillion tokens of code. The data mixture is roughly 200 billion tokens across five epochs, plus a linear cooldown at the end that really allows us to squeeze the best possible performance for the model. And rapid code. V 1.5 this time supports 30 programming languages. And we also added a mixture coming from Stack Exchange posts that are oriented towards developers. So questions about coding, questions about software engineering, and so forth. So this is the basis of our data. Now let's go and take a look inside at the data set that we used. So we started from the Stack, which is an initiative led by big code. It's a group in oh under the hugging phase umbrella. Very grateful about the work that these people have been doing. Basically, they have built a big pipeline, getting data from GitHub, selecting top repositories, cleaning up part of the data, and then especially leaving only code that is licensed under permissive licenses such as MIT, BSD, Apache Two and so forth. Out of this mixture, we selected 30 top languages. And then, really, the key secret ingredient here is how much time we spent working on the data. You must have been hearing this again and again, and every time you go to an LLM talk, there is a ground stage saying you should pay attention about the data quality. I'm here to tell you exactly the same once again, that's probably the most important thing that you could be spending your time on, especially because the model I'm talking about today is trained from scratch. So this is not a fine tuning. All the models that we release have been trained from the very first token prepared by us. So it's extremely important to have high data quality. So we took inspiration from the initial quality pipelines built by Codex by the Pound paper, and then we applied way more heuristics there. So we're filtering for code that is being auto generated, minified, non, parsable, basically all the code that you wouldn't want your model to recommend back to you, because it's not something that you would be writing yourself. We also remove toxic content, and all this pipeline will be built on Spark. So I'm trying to encourage you to also think of working on your own models, because pretty much a lot of the base components are out there, available open source. So you could really build the whole pipeline to train and serve an LLM with a lot of open source components. And as Weeks was saying, you have seen this crazy acceleration in the last nine months. If you wanted to do this in 2022, good luck with that. It feels like we're a decade ahead compared to last year, so it's pretty amazing. And I didn't even expect in myself the speed to move this fast. The other inside that we kind of pioneered for our V One model and turns out to be very powerful also for this new one. So when we released the V One a few weeks after, coincidentally, a very interesting paper has been published called Scaling Data Cost Train Language Models. And I highly recommend it. It's a great read and it's probably one of the most interesting results in LLM in my bull opinion. And this intuition allowed us to basically train the model to completion. Rather than making trade offs on the data quality, it allowed us to select a small, high quality subset of data and then repeat it several times. The key finding of this paper is basically in these two plots, I'm going to be sharing the slides so you can go and check the links. And the idea is your loss curve after you repeat data four or five times is going to be comparable to training on a novel data set. Okay? Now, not only this is very useful because it allowed us to work only on high quality data, it also allowed us to work with data that is exclusively released under permissive license. Therefore, once again, for our 1.5 model, we're going to be releasing it open source and it's going to be released with a commercially permissive license. So you can use it. There you go. Just shoot us an email when you use it because I'm very curious if you're having a good time. So, details about the model training, we change a few things here and there. Slightly larger model, it's a 3.3 B it's 4K. Context birth. The old one was a two K. We train a new domain specific vocabulary, 32K. So a small one, it helps us to achieve even higher compression on the data. If you've been reading again about LLMs, you know that from a simplistic point of view, there are data compressors, lossy data compressors. So if your vocabulary allows you to pack even more data on fewer tokens, then you're basically bringing more signals to the model while you're training. And with this new vocabulary, we're squeezing a few percent extra and it's a better vocabulary for code compared to what star coder or codelam are using today. We trained 128 H 180 gigs GPUs, which are as rare as gold at this point. We have been on the Mosaic ML platform for a week, and to our knowledge, this is the first model officially announced to be trained on H 100. And release open source. So we're very excited about it and we follow a list of LLM best practices. So of course we support fresh attention, we have group query attention, which allow us to achieve better inference, performance, alibi, position, embedding, latest optimizers in the game. And that is really the reason why at the end you will see very exciting numbers that I don't want to spoil right away. So let's start from the base model. And then there is surprise coming. This is the evaluation parcel one on human Eval. For those of you who never heard about it, humane Eval is a benchmark release back in 2021 by OpenAI, if I recall correctly. The format is the following. You have a natural language description of a task in English and then expect the model to generate a self contained Python Snippet. That then is going to be tested with a test harness. So you generate code and then you execute it and you see if the values in output are exactly what you expect. Now, an interesting evolution in the last few months in the field is we were not content on benchmarking exclusively on Python. So we're also doing that across several different programming languages. And this is coming from the multilingual code Eval Arnis, again, built by BigCode. And they also maintain a very interesting leaderboard. So what they do is they take models across several companies and several open source contributors. They run Evals themselves and then they compile these very interesting leaderboards. So you will find us there, I guess, in a few days. So from the left column we have Star coder three B, which as of yesterday was the state of the art model at the three B parameter size across languages. And today our Whip 1.5 is basically optimal across every single language that you see on the list. But what gets me excited is not that much the fact that we are more powerful than Starcoder, which has been released a few months ago. What got me hyped when we were training it, is that we're very, very close to Columbus EV. So as a reminder, Columbus seven B is a Llama two model from Meta, the seven B version, which has been trained on 2 trillion tokens of natural language. And then it has an additional pretraining phase of 500 billion tokens exclusively on code. So it's a model that is twice the size. It's 2.5 x, more data, way more GPU compute. So you see where I'm going, we're getting very close. How do we surpass code lambda? Here is the trick. This is the other model that we have been training in parallel. And this is the replune version. And it means the following. We further pretrained it on 200 billion tokens of code, this time coming from our home developers. So on Replit, when you create a public REPL, it's automatically published under IMET license. So we use this code to further pretrain our model and we extract again 30 billion tokens of code. Same languages, same data filtering, pipeline to retain only the top quality ones. We do these three epochs, then we do a Selena Cooldown and we are using basically the languages that are predominantly popular for Replit users. So not the same list as we saw before. If you go on replete, I would say 95% of the people are mostly writing Python and JavaScript. These are the cool languages of today. Another key insight is our cut off for this model is literally a few weeks ago. So if there is a cool new library that everyone is writing software for in the last month, our model is going to be capable of generating code that follows that library and we're going to keep basically these models up to date so that we can follow the trends and we can make our developers more happy. Here is the table that I love. So we're back to this back to back comparison. On the very left we have our base model. We didn't add Starcoder here for a sake of space and also the base model is already topping it on every other language, so it didn't make sense. Now we have Kalama in between and you can see why we are on pretty much every language substantially better. So we have 36% on the OpenAI uminival, benchmark. As a reminder, when I was working on Pancoder, for example, that was our passed one result that we published in early 2022. That model was a 530,000,000,000 tokens. So almost 200 x larger than this model and it achieves exactly the same uneval passive one performance, same code. DA Vinci one, if you go back to the paper, is getting exactly 36%. So we were pretty much amazed when this happened. Now, why do we go through all this struggle of training our models? Not only because it's cool, we love to do this stuff, but there is a rationale behind it. So we really want to go as fast as possible with the most powerful small model we could train. And the reason is all of our models are actually optimized for inference rather than for being awesome at benchmarks. The fact that that happens gives us a lot of pride and also makes us feel good when we do a wipe check with the model and it performs as we expect or even better. But it turns out that our key result is on a single model with no batching. We're generating above 200 tokens per second and we tune the architecture for speed in every possible way. We're training a smaller vocabulary. As I was saying before, we are using a flash attention with a trout on kernel. We're using the latest GQA. So every single aspect is there to make sure that we can go as fast as we can. And we optimize basically for the usage on the Triton inference server and acceleration frameworks such as Tensor or TLLM, they really squeeze the last drop for MB, the GPUs. Now, the other very interesting insight is we work very hard also to make the model deployment go much faster. So if you ever had the bad luck to work with Kubernetes in your life, thankfully it can get to get your pod and download all the dependencies and build it and yada yada. So the very first time we brought this infrastructure up, it took 18 minutes to go from clicking until the model was deployed. Now, if you want to adapt to the load that the application is receiving, 18 minutes looks like an eternity. If there is a traffic spike, good luck with that. So one of our awesome engineers, Bradley, you're going to find him at the booth later today, brought this number from 18 minutes to just two minutes. There is a laundry list of tricks that he used. I'm not going to go through them, just talk to Brad. The cool insight here is the fact now whenever we get more load, we can react very quickly. And that's how we serve a very large user base. So the moment that I'm jod announced AI for all, literally ten minutes ago, we flipped the switch and now code completion is in front of all our users. And that's the way we made this happen. Now, I've been asked several times, guys, why are you releasing your model open source? You put so much effort. Maybe not that's an advantage for a company. It turns out that the moment we did it, we got a lot of adoption and apart from a lot of log, which always feels good and it feels good to chat with other people in AI that are using what we build. We also started to get fine tuned versions, instruct tuned versions of that and we have seen a lot of people using our small model deployed in local, say with GGML, which goes super fast on Apple silicon. And they built their own custom privacy aware, like GitHub Copilot alternative with Rapid V One. So we expect the same to happen with B 1.5 in the next few days as we speak. Also, if you go on again phase, the model is available, we're working on the README come to tolwin Madava. The boot is the mastermind behind it, so it's going to tell you every single detail on how to make it run in production. And we're going to be here until tonight. So more than happy to play with the model together. Now, in the last minute I've left, I want to give you like a teaser of what we're going to be doing in the next weeks. So we aligned a few very exciting collaborations. The first one is with Glave AI and it's a company that is building synthetic data sets. And we're working on an IFT version of our models on a Struct fine tune version over 210,000 coding instructions. We're already seeing very exciting results. We want to triple check them and follow our Twitters. And the moment that we're sure that this is performing as we expect, is going to be out there and you're going to be able to play with it. Second announcement. We're also collaborating with more flubs, I think. Jesse is here today and is going to run a session later explaining you exactly what this new format does. I'm going to give you a teaser and then go to Jesse's talk and he's going to explain you all the details. So we are design partners on the fifth format, which is Fill in the syntax tree. You might have heard of fill in the middle, this concept where you can take your file, split it in a half, and then basically, if you're writing code in between, you can tell the LLM that the top of the file is your prefix, the bottom of the file is your suffix. And you give this context to the model so that it knows which part you should feel. Now, we found that this format is even more powerful, is aware of the Abstract Syntax three underlying the source code. We're seeing very promising results already. And again, this will be out just a matter of like a few days or weeks. Last thing, we have collaborations with perplexity AI guys. You might have used their labs. So it's a place where the host models incredibly fast and the Rabbit B 1.5 will appear there and you can start to play with it and get a vibe check by tonight. Thanks, everyone. Ladies and gentlemen, please welcome to the stage the inventor of Auto GPT and his team Torrin, Bruce Richards.