Back to all jobs

- Employment
- Full-time
- Seniority
- Staff
About the role
What you will be doing:
- Partner with Engineering teams to design resilient services, architectures, and deployment patterns.
- Define and promote SRE practices including SLIs, SLOs, error budgets, capacity planning, incident response, and post-incident learning.
- Identify systemic reliability risks and work with teams to address root causes.
- Help reduce operational toil through automation, tooling, and better engineering practices.
- Work actively with Engineering teams during design, development, and production-readiness reviews.
- Advise and challenge teams on service architecture, fault tolerance, scalability, observability, deployment safety, and operational readiness, helping them to make pragmatic trade-offs.
- Support teams in diagnosing complex performance, latency, throughput, and resource-utilisation issues.
- Help establish engineering standards and reusable patterns for reliable, maintainable services.
- Lead investigations into performance bottlenecks across applications, infrastructure, databases, queues, networks, and third-party dependencies.
- Improve observability through metrics, logs, traces, dashboards, alerting, and service-level indicators.
- Help teams design meaningful alerts that identify user-impacting issues while reducing noise.
- Drive capacity planning and load-testing practices for critical systems.
- Build and improve automation, deployment tooling, infrastructure-as-code, monitoring, and reliability platforms.
- Contribute to CI/CD improvements, release safety, rollback strategies, and progressive delivery practices.
- Develop tools that help Engineering teams self-serve reliability, diagnostics, and operational insights.
- Improve cloud, container, and orchestration environments with a focus on security, reliability, and scalability.
- Participate in incident response for high-priority production issues.
- Lead or contribute to blameless post-incident reviews.
- Ensure actions from incidents result in improvements to architecture, tooling, monitoring, or process.
- Mentor engineers on production ownership and operational best practices.
What you will bring to the role:
- Experience in Site Reliability Engineering or senior backend/software engineering roles.
- Software engineering background, with the ability to write clean, maintainable production code.
- Experience working with Engineering teams to influence architecture and improve production readiness.
- Understanding of distributed systems, scalability, resiliency patterns, failure modes, and performance engineering.
- Experience diagnosing complex production issues across application and infrastructure layers.
- Hands-on experience with cloud platforms such as AWS, Azure, or GCP.
- Hands-on experience with on-premise environments and virtualization.
- Experience with containers and orchestration technologies, Kubernetes is a must.
- Knowledge of observability tooling, including metrics, logging, tracing, dashboards, and alerting.
- Experience with infrastructure-as-code tools such as Terraform.
- Experience with CI/CD pipelines and safe deployment practices.
- Strong scripting or programming skills in languages such as Python, Go, Java, C#, JavaScript/TypeScript, or similar.
- Clear and structured communication skills, with the ability to explain complex technical issues clearly to engineering and leadership audiences.
Diversity, Inclusion, and Equal Opportunity
741,000+ hidden jobs like this
Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites