Back to all jobs
F
Software Engineer, Site Reliability
fal
Turkey4w ago
About the role
<p>You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems — from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.</p>
<h3><strong>Key Responsibilities</strong></h3>
<ul>
<li>Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads</li>
<li>Build and maintain CI/CD pipelines and deployment infrastructure</li>
<li>Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability</li>
<li>Build dashboards, alerting, and anomaly detection across our systems</li>
<li>Define and enforce SLOs and build out incident response processes</li>
<li>Manage and improve our networking, load balancing, and service mesh configurations</li>
<li>Drive reliability improvements across the stack through automation, runbooks, and chaos engineering</li>
</ul>
<h3><strong>Requirements</strong></h3>
<ul>
<li>5+ years experience in managing critical production systems and software development workflows</li>
<li>Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)</li>
<li>Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS</li>
<li>Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)</li>
<li>Proficiency in Python and either Go or Bash for tooling and automation</li>
<li>Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)</li>
<li>Excellent communication and ability to drive technical decisions across teams</li>
<li>Self-starter who executes quickly, takes ownership, and constantly seeks improvement</li>
</ul>
<h3>Nice to have</h3>
<ul>
<li>Experience with managing GPU and AI/ML workloads</li>
<li>Experience with kernel-based monitoring and routing (eBPF, XDP)</li>
<li>Experience with security tooling (Falco, Coroot, SIEM)</li>
<li>Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)</li>
<li>Experience with distributed storage systems (Ceph, Longhorn, etc.)</li>
</ul>
<h3><strong>Location</strong></h3>
<ul>
<li>
<p>Turkey</p>
</li>
</ul>
<h3><strong>What we offer at fal</strong></h3>
<ul>
<li>Interesting and challenging work</li>
<li>A lot of learning and growth opportunities</li>
<li>Regular team events and offsites</li>
</ul>
731,000+ hidden jobs like this
fal and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites