Back to all jobs
F

Software Engineer, Infrastructure

fal

San Francisco4w ago

About the role

<p>You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including&nbsp; provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners.</p> <h3><strong>Key responsibilities</strong></h3> <ul> <li>Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc</li> <li>Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting</li> <li>Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)</li> <li>Leverage AI to an extreme level to build tools and automate alerting and recovery</li> <li>Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation</li> <li>Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage</li> <li>Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)</li> <li>Develop a suite of automated error detection and recovery processes</li> <li>Work with partners to solve technical issues</li> </ul> <h3><strong>Requirements</strong></h3> <ul> <li>3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes)</li> <li>Strong software engineering skills in Python; you write production tooling, not scripts</li> <li>Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling</li> <li>Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init</li> <li>Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning</li> <li>Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)</li> <li>Experience building internal tools or dashboards for infrastructure visibility</li> <li>Excellent communication and ability to drive technical decisions across teams</li> <li>Self-starter who executes quickly, takes ownership, and constantly seeks improvement</li> </ul> <h3>Nice to have</h3> <ul> <li>Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump)</li> <li>Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2</li> <li>Experience with AMD GPUs</li> <li>Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM)</li> <li>Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)</li> </ul> <h3><strong>Compensation</strong></h3> <ul> <li>$180,000-250,000 plus equity + benefits</li> </ul> <h3><strong>Location</strong></h3> <ul> <li> <p>San Francisco, CA (we are open to remote in the US for Senior and Staff levels)</p> </li> </ul> <h3><strong>What we offer at fal</strong></h3> <ul> <li>Interesting and challenging work</li> <li>A lot of learning and growth opportunities</li> <li>We are offering relocation assistance to San Francisco.</li> <li>We offer relocation assistance to San Francisco.</li> <li>Health, dental, and vision insurance (US)</li> <li>Regular team events and offsites</li> </ul>

Perks & benefits

  • Vision Insurance
  • Equity Compensation

731,000+ hidden jobs like this

fal and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.