Back to all jobs
L

Senior Site Reliability Engineer

Lumalabs Ai

United States$170k–290kRemote6mo ago
Employment
Full-time
Seniority
Senior

About the role

  • Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates.
  • Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
  • Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment.
  • Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level.
  • Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
  • 5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
  • Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance.
  • Expert in Technologies: You have working experiencewith Terraform, Airflow, and Ray
  • Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI. 
  • Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect.
  • Startup DNA: You are energetic and thrive in a less structured, fast-paced environment.
  • Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
  • Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.
  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
  • Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).
  • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
  • Deep expertise in Data Pipeline and Infrastructure

Compensation

About Luma

764,000+ hidden jobs like this

Lumalabs Ai and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.