Back to all jobs

- Employment
- Full-time
About the role
What you will be doing:
- Ensure the availability, performance, and reliability of critical applications and services by designing and implementing robust monitoring, alerting, and optimization strategies.
- Define, measure, and maintain SLIs, SLOs, and error budgets to support service reliability goals.
- Partner with development teams to improve performance, reduce latency, and increase the resilience of applications in production.
- Work closely with platform and DevOps teams to ensure smooth alignment between infrastructure and application reliability.
- Define reliability standards and operational guardrails for platform capabilities and golden paths.
- Partner with platform engineering teams to design resilient self-service capabilities.
- Automate operational tasks such as deployments, rollbacks, scaling, failover, and recovery processes.
- Continuously improve CI/CD pipelines to reduce manual intervention and support safe, progressive delivery practices.
- Integrate automated validation, reliability checks, and operational guardrails into development and deployment workflows.
- Implement and maintain observability capabilities across production systems, including metrics, logs, traces, and dashboards.
- Develop dashboards, alerts, and operational views that provide real-time visibility into system health and application behavior.
- Act as a key responder during incidents, collaborating across teams to troubleshoot, mitigate, and resolve production issues.
- Conduct root cause analysis for incidents and drive long-term corrective actions to prevent recurrence.
- Run fire drills, game days, and chaos engineering exercises to validate system resilience under failure conditions.
- Monitor resource usage, capacity trends, and scaling behavior to support future growth and performance needs.
- Partner with security teams to ensure services align with security best practices, including secure communication, access controls, and data protection.
- Lead or contribute to regular production readiness and operational review meetings to assess system health, review incidents, and prepare for releases.
- Promote reliability engineering best practices across teams and help strengthen the overall operational maturity of the organization.
What you will bring to the role:
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Proven experience in Site Reliability Engineering, Platform Engineering, or Infrastructure/DevOps roles with strong operational ownership
- Strong expertise in application monitoring, observability platforms, incident response, and troubleshooting in production environments.
- Strong understanding of reliability engineering concepts such as SLIs, SLOs, error budgets, alerting quality, and incident management.
- Proficiency in scripting and automation using tools and languages such as Python, Bash, Terraform, Ansible, or similar.
- Experience with cloud platforms such as AWS, Azure, or Google Cloud.
- Strong knowledge of CI/CD pipelines, deployment automation, and progressive delivery practices.
- Strong knowledge of infrastructure as code and configuration management approaches.
- Experience with containerization and orchestration, such as Docker and Kubernetes.
- Strong problem-solving skills, operational judgment, and attention to detail.
- Excellent communication and collaboration skills, with the ability to work effectively across engineering, platform, and security teams.
- Experience with chaos engineering practices and tools.
- Experience supporting internal platforms or platform engineering teams.
- Familiarity with developer portals, golden paths, service catalogs, or self-service platform patterns.
- Understanding of developer experience metrics and operational maturity for internal platforms.
- Familiarity with microservices architectures and multi-tenant environments.
- Experience with modern observability stacks and telemetry standards.
- Understanding of UCaaS and CCaaS platforms, especially voice and communication service flows.
- Experience leading reliability initiatives, incident reviews, or production improvement programs.
- Familiarity with capacity planning, resilience testing, and operational readiness practices.
Diversity, Inclusion, and Equal Opportunity
747,000+ hidden jobs like this
Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites