Back to all jobs
T

Staff ML Platform Engineer

TrueFoundry

BengaluruHybrid4mo ago
Seniority
Staff

About the role

<p><strong>About TrueFoundry<br></strong><br>Every production AI system, whether it's powering customer support, writing code, analyzing financial data, or diagnosing medical conditions, needs the same foundational infrastructure.A way to route between models. A way to manage tools and integrate them securely. A way to orchestrate agents and enforce governance. A unified compute layer to run it all.</p> <p><strong>That infrastructure layer is being built right now.</strong></p> <p>We're TrueFoundry, and we're building it. We're looking for a Staff ML Platform Engineer to join the team.</p> <h2><strong>The Problem We're Solving</strong></h2> <p>Companies are moving beyond simple chatbots to production agentic systems. These systems route between OpenAI, Anthropic, Google, and self-hosted models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents.</p> <p>The infrastructure to support this doesn't exist yet. You can't just duct-tape together a few API calls and call it production-ready.</p> <p>You need a control plane that handles:</p> <ul> <li>Intelligent routing with observability, cost policies, and fallback logic</li> <li>Centralized tool and MCP server management with security and lifecycle controls</li> <li>Agent orchestration with governance and guardrails</li> <li>A unified compute layer to run self-hosted models, custom tools, and agents</li> </ul> <p>We've built two products to solve this:</p> <p><strong>AI Gateway</strong> is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.</p> <p><strong>AI Deploy</strong> is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.</p> <p>We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.</p> <p>We're looking for <strong>ML Engineers</strong> who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges this is your place.</p> <h3><strong>What You’ll Work On</strong></h3> <ul> <li><strong>Write clean, modular, and scalable Python code</strong>, with a strong emphasis on reliability and performance.</li> <li><strong>Build platform for training and finetuning large-scale ML models</strong> across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.</li> <li><strong>Own the infrastructure and code</strong> that enables high-throughput, low-latency inference pipelines for state-of-the-art models.</li> <li><strong>Build platform for developing, deploying and evaluating </strong>agentic applications for our end customers.&nbsp;</li> <li>Help shape internal standards and best practices across the engineering team for high-scale ML workloads.</li> </ul> <h3><strong>What We’re Looking For</strong></h3> <ul> <li><strong>5+ years of hands-on experience</strong> building and deploying ML systems at scale.</li> <li>5+ years of writing production quality high performance code.</li> <li>Deep experience with <strong>multi-GPU/multi-node training</strong>, ideally with PyTorch as your primary framework.</li> <li>Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).</li> <li>Experience with <strong>Kubernetes</strong> is highly preferred; exposure to Kubernetes-native tools is a huge plus.</li> <li>A pragmatic mindset you know when to optimize and when to ship.</li> <li>Bonus: Familiarity with open-source LLM training/fine-tuning.</li> </ul> <h3><strong>Why Join TrueFoundry?</strong></h3> <ul> <li>Work directly with <strong>ex-Facebook engineers</strong> and <strong>founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni</strong>.</li> <li>First-hand exposure to building and scaling a <strong>deep-tech startup </strong>insights you’ll carry if you want to start your own one day.</li> <li>Be part of a <strong>fearlessly experimental culture</strong> focused on customer success and long-term impact.</li> <li>Flexible hours, learning credits, and the opportunity to work <strong>shoulder-to-shoulder with the co-founders</strong></li> </ul>

731,000+ hidden jobs like this

TrueFoundry and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.