Philip Kiely & Pankaj Gupta - From model weights to API endpoint with TensorRT-LLM
Video Available!
GPUs & Inference: TensorRT-LLM is the highest-performance model serving framework, but it can have a steep learning curve when you’re just getting started. We run TensorRT and TensorRT-LLM in production and have seen both the incredible performance gains it offers and the hurdles to overcome in getting it up and running. In this workshop, participants will learn how to start using TensorRT-LLM, including selecting a model to optimize, building an engine for it with TensorRT-LLM, setting batch sizes and sequence lengths, and running it on a cloud GPU.
https://github.com/basetenlabs/Workshop-TRT-LLM
We have now sold out of Early Bird tickets; General Admission has also sold out.
Please join us online for the free livestream.