Software Engineer, Site Reliability

fal

Turkey4w ago

About the role

<p>You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems — from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.</p> <h3><strong>Key Responsibilities</strong></h3> <ul> <li>Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads</li> <li>Build and maintain CI/CD pipelines and deployment infrastructure</li> <li>Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability</li> <li>Build dashboards, alerting, and anomaly detection across our systems</li> <li>Define and enforce SLOs and build out incident response processes</li> <li>Manage and improve our networking, load balancing, and service mesh configurations</li> <li>Drive reliability improvements across the stack through automation, runbooks, and chaos engineering</li> </ul> <h3><strong>Requirements</strong></h3> <ul> <li>5+ years experience in managing critical production systems and software development workflows</li> <li>Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)</li> <li>Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS</li> <li>Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)</li> <li>Proficiency in Python and either Go or Bash for tooling and automation</li> <li>Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)</li> <li>Excellent communication and ability to drive technical decisions across teams</li> <li>Self-starter who executes quickly, takes ownership, and constantly seeks improvement</li> </ul> <h3>Nice to have</h3> <ul> <li>Experience with managing GPU and AI/ML workloads</li> <li>Experience with kernel-based monitoring and routing (eBPF, XDP)</li> <li>Experience with security tooling (Falco, Coroot, SIEM)</li> <li>Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)</li> <li>Experience with distributed storage systems (Ceph, Longhorn, etc.)</li> </ul> <h3><strong>Location</strong></h3> <ul> <li> <p>Turkey</p> </li> </ul> <h3><strong>What we offer at fal</strong></h3> <ul> <li>Interesting and challenging work</li> <li>A lot of learning and growth opportunities</li> <li>Regular team events and offsites</li> </ul>

731,000+ hidden jobs like this

fal and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime