AI Engineer World's Fair cover art

AI Engineer World's Fair• San Francisco

Personal Favorite

Optimizing inference for voice models in production

How do you get time to first byte (TTFB) below 150 milliseconds for voice models -- and scale it in production? As it turns out, open-source TTS models like Orpheus have an LLM backbone that lets us use familiar tools and optimizations like TensorRT-LLM and FP8 quantization to serve the models with low latency. But client code, network infrastructure, and other outside-the-GPU factors can introduce latency in the production stack. In this talk, we'll cover the basic mechanics of TTS inference, common pitfalls to avoid in integrating them into production systems, and how to extend this high-performance system to serve customized models with voice cloning and fine-tuning.

NVIDIA GTC cover art

NVIDIA GTC• San Jose

Personal Favorite

Advanced Techniques for Inference Optimization With TensorRT-LLM

Learn new strategies for large language model (LLM) inference optimization that apply cutting-edge research in battle-tested production implementations to massively improve latency, throughput, and cost. We’ll go beyond the basics of engine building and quantization with in-depth explorations of why and how to use new techniques like speculative decoding, tensor parallelism, and LoRA swapping with TensorRT-LLM.

AWS re:Invent cover art

AWS re:Invent• Las Vegas

High-performance inference for frontier AI models

How do you go from state-of-the-art foundation model to a globally available usage-based API? This session provides an overview of the roadmap and architecture from model performance optimization to multi-region GPU-based infrastructure setup to the precise mechanics of request prioritization, token accounting, and rate limiting. Building on an inference stack that includes technologies like NVIDIA Dynamo, TensorRT-LLM, and Amazon EKS, we'll trace the path from model weights to a robust production service.

Open Data Science Conference cover art

Open Data Science Conference• San Francisco

Milliseconds Matter: 2x Faster Embedding-Based Search, Retrieval, and RecSys in Production

Applications from search and retrieval to recommendation systems and classifiers increasingly depend on embedding models built with LLM architectures, with new models like Qwen 3 Embed setting state-of-the-art benchmarks for accuracy. But applications like search and RecSys have extremely tight p99 latency SLAs, must support massive numbers of concurrent users, and need to operate cost-effectively. This session covers the end-to-end journey of optimizing embedding model inference in production. We cover the runtime layer, evaluating options from TEI and vLLM to a custom TensorRT-LLM based runtime. Then, we look at the serving layer and its batching mechanisms. Finally, we turn our attention to the infrastructure and client code components, where small adjustments can yield massive returns in observed latency under real traffic. You’ll walk away from this session with a complete understanding of the path from promising prototype to production deployment.

PyTorch Conference cover art

PyTorch Conference• San Francisco

Low-Precision Inference without Quality Loss: Selective Quantization and Microscaling

Everyone wants faster inference, but no one wants to compromise the quality of their model outputs. FP8 quantization offers 30-50% lower latencies for inference on large models, but must be applied carefully to maintain quality. Recently, NVIDIA Blackwell GPUs introduced new microscaling number formats (MXFP8, MXFP4, NVFP4) and new kernel options for low-precision inference. In this talk, Baseten inference engineers will cover practical applications of quantization to quality-sensitive inference tasks with a focus on selecting which parts of the inference system to quantize (weights, activations, KV cache, attention) and how microscaling number formats help preserve dynamic range.

Significance Summit cover art

Significance Summit• San Francisco

Inference Engineering for Hypergrowth

Philip Kiely (Head of Dev Relations, Baseten) breaks down how inference engineering powers AI at scale—and why it’s key to driving hypergrowth.

The AI Conference cover art

The AI Conference• San Francisco

Real-Time, Real Problems: Scaling AI In The Wild

Real-time generative AI applications promise transformative user experiences—but scaling inference in production is a constant battle. Demand spikes, infrastructure meltdowns, and unpredictable model behaviors lurk behind every release.

AI Engineer World's Fair cover art

AI Engineer World's Fair• San Francisco

Introduction to LLM serving with SGLang

Do you want to learn how to serve models like DeepSeek and Qwen with SOTA speeds on launch day? SGLang is an open-source fast serving framework for LLMs and VLMs that generates trillions of tokens per day at companies like xAI, AMD, and Meituan. This workshop guides AI engineers who are familiar with serving models using frameworks like vLLM, Ollama, and TensorRT-LLM through deploying and optimizing their first model with SGLang, as well as providing guidance on when SGLang is the appropriate tool for LLM workloads.

Optimized AI Conference cover art

Optimized AI Conference• Atlanta

The Golden Triangle of Inference Optimization: Balancing Latency, Throughput, and Quality

Philip Kiely, Head of Developer Relations at Baseten, presents the “Golden Triangle” of inference optimization—balancing latency, throughput, and quality in large-scale AI systems. Just as in software engineering trade-offs like “good, fast, cheap—pick two,” inference optimization requires carefully navigating competing priorities to deliver production-ready AI applications.

AI Engineer World's Fair cover art

AI Engineer World's Fair• San Francisco

From model weights to API endpoint with TensorRT LLM

TensorRT-LLM is the highest-performance model serving framework, but it can have a steep learning curve when you’re just getting started. We run TensorRT and TensorRT-LLM in production and have seen both the incredible performance gains it offers and the hurdles to overcome in getting it up and running. In this workshop, participants will learn how to start using TensorRT-LLM, including selecting a model to optimize, building an engine for it with TensorRT-LLM, setting batch sizes and sequence lengths, and running it on a cloud GPU.

Speaker bio

Philip Kiely is the author of Inference Engineering and is employee number ten at Baseten. Outside of work, you'll find Philip practicing martial arts, reading a new book, or cheering for his adopted SF sports teams.

Speaker kit (PDF)Speaker rider (PDF)Headshot