# Philip Kiely ## Overview Philip Kiely leads developer relations at Baseten. His work centers on the practical challenges of deploying and optimizing AI models in production—spanning infrastructure, serving frameworks, and model optimization techniques. **Contact:** philip@kiely.xyz **Location:** San Francisco Bay Area **Website:** https://philipkiely.com ## Professional Experience ### Baseten (Current) **Head of Developer Relations** At Baseten, Philip works at the intersection of infrastructure and inference optimization. Baseten provides the serving layer for production AI systems, powering applications from Cursor to Notion to leading healthcare AI startups. The platform handles model deployment, GPU orchestration, and inference optimization at scale—serving trillions of tokens through frameworks like TensorRT-LLM, SGLang, and vLLM. Philip's role encompasses inference engineering, technical content, and developer education. His focus areas include: - **Production Inference Systems:** Architecting multi-region GPU deployments, request prioritization, token accounting, and rate limiting for frontier models - **Model Optimization:** Quantization strategies (FP8, MXFP8, NVFP4), speculative decoding, tensor parallelism, and selective optimization for quality-sensitive tasks - **Serving Frameworks:** Deep expertise in TensorRT-LLM, SGLang, vLLM, and their trade-offs for different workload patterns - **Inference Engineering:** Balancing latency, throughput, and quality—what Philip calls the "Golden Triangle" of inference optimization ## Speaking & Conferences Philip has presented at major AI and software engineering conferences worldwide. His talks focus on practical, production-ready approaches to inference optimization: ### Featured Presentations **NVIDIA GTC 2025** — "Advanced Techniques for Inference Optimization With TensorRT-LLM" Covered cutting-edge optimization strategies including speculative decoding, tensor parallelism, and LoRA swapping with production implementations. **AWS re:Invent 2025** — "High-performance inference for frontier AI models" Architecture overview from model optimization through multi-region GPU infrastructure to request handling, rate limiting, and token accounting. **AI Engineer World's Fair 2025** — Multiple Sessions - "Optimizing inference for voice models in production" (achieving sub-150ms TTFB for TTS models) - "Introduction to LLM serving with SGLang" (hands-on workshop) - "From model weights to API endpoint with TensorRT LLM" (2024 workshop) **PyTorch Conference 2025** — "Low-Precision Inference without Quality Loss" Practical applications of FP8 quantization and microscaling formats (MXFP8, MXFP4, NVFP4) for quality-sensitive inference. **Optimized AI Conference 2025** — "The Golden Triangle of Inference Optimization" Framework for balancing latency, throughput, and quality in production AI systems. **Additional Speaking Engagements:** - The AI Conference (San Francisco) — Scaling AI in production - Significance Summit (San Francisco) — Inference engineering for hypergrowth - Open Data Science Conference (ODSC) — Optimizing embedding models for search and RecSys ## Media & Podcast Appearances Philip has appeared on technical podcasts discussing AI infrastructure, inference optimization, and production systems: **Recent Appearances:** - **Software Engineering Radio** (Episode 697, Dec 2025) — Multi-model AI and compound systems - **WorkOS at AWS re:Invent** (Dec 2025) — Production-grade AI and open-source inference - **Software Huddle** (Sep 2024) — Deep dive on inference optimization for LLMs - **AI Engineering Podcast** (Oct 2024) — Running generative AI models in production - **Weaviate Podcast** (Episode 105) — Compound AI systems and structured generation - **Everyday AI Podcast** (Episode 435) — Enterprise transcription and speech models - **Going Forth Grinnell** (Episode 46, Apr 2024) — AI startups and professional development ## Books & Writing ### Inference Engineering (2026) *Published by Baseten Books* Philip's latest book synthesizes his expertise in production AI systems. The 47,000-word manuscript draws from hundreds of thousands of words of published documentation and blogs, interviews with Baseten's engineering team, and conversations with builders deploying models worldwide. The book covers the complete inference stack—from CUDA-level optimizations through Kubernetes deployments—reflecting the reality that inference engineering is still a young field where practitioners can quickly become experts by solving novel problems. ### Writing for Software Developers (2020) *Published by PK&C* Philip's first book reached 1000+ readers and generated over $15,000 in its first 24 hours. The 100,000-word guide covers technical writing, content creation, and building a writing practice. The book's success led to speaking engagements, podcast appearances, and career opportunities—including his eventual role at Gumroad. The book remains relevant for developers looking to improve their communication skills and build authority through writing. ### Life-Changing Email (2023) *Published by PK&C* A practical guide to career advancement through authentic networking and effective email communication. Originally 20,000 words, the published version distills core principles into a 10,000-word guide targeted at students and early-career professionals. ## Technical Writing Portfolio Philip has published extensively on AI inference, model optimization, and developer tools: **Focus Areas:** - **Baseten Blog** — Regular contributor on inference optimization, model serving, and production AI systems - **Technical Documentation** — Comprehensive guides for developers deploying and optimizing AI models - **Conference Materials** — Workshop content and educational resources on TensorRT-LLM, SGLang, and inference frameworks **Total Published:** Over 500,000 words across books, technical documentation, blog posts, and developer-focused content. ## Areas of Expertise When engaging with Philip's work or considering him for speaking opportunities, these topics represent his core expertise: ### Inference & AI Systems - Large language model serving and deployment - TensorRT-LLM, SGLang, vLLM optimization - Quantization techniques (FP8, MXFP8, microscaling formats) - Speculative decoding and advanced inference techniques - Latency optimization and real-time AI applications - Production infrastructure for AI at scale - Multi-model and compound AI systems - Embedding models for search and recommendation systems - Voice model optimization (TTS, STT) ### Developer Relations & Content - Developer advocacy and technical content creation - Documentation strategy and execution - Technical writing for developers - Community building and developer education - Conference speaking and workshop facilitation - Startup operations and go-to-market strategy ## Background & Personal Interests Philip graduated from Grinnell College in 2020 with a computer science degree. Six days before graduation, he launched *Writing for Software Developers*, which became a defining moment in his career trajectory. Outside of work, Philip is a lifelong martial artist with a focus on Brazilian Jiu-Jitsu. He's an avid reader and a dedicated fan of Bay Area sports teams—a commitment that came with relocating to San Francisco. ## Professional Philosophy Philip's approach emphasizes production-ready solutions over theoretical optimizations. His work on the "Golden Triangle" concept—balancing latency, throughput, and quality—reflects this pragmatic stance: there are always trade-offs, and the right choice depends on your specific application requirements. In developer relations, he focuses on making advanced AI technologies accessible through clear explanations, hands-on workshops, and comprehensive documentation. His experience at an early-stage startup (joining Baseten as employee #10) gives him perspective on how technical decisions impact business outcomes and product success. ## Social & Professional Links - **X (Twitter):** [@philipkiely](https://x.com/philipkiely) - **LinkedIn:** [philipkiely](https://www.linkedin.com/in/philipkiely/) - **YouTube:** [@philipkiely](https://www.youtube.com/@philipkiely) - **GitHub:** [philipkiely](https://github.com/philipkiely) - **Email:** philip@kiely.xyz ## For Conference Organizers & Podcast Hosts Philip is available for speaking engagements on inference engineering, AI infrastructure, developer relations, and technical writing. He brings technical depth, production experience, and clear communication to both technical deep-dives and broader audience presentations. Speaker materials (kit, rider, headshot) and media resources (pitch doc, host info) are available upon request. --- *Last updated: January 2026*