Back to all jobs
Intermedia Intelligent Communications logo

Site Reliability Engineer - IDP

Intermedia Intelligent Communications
WorldwideRemote1mo ago
Employment
Full-time

About the role

What you will be doing:

  • Ensure the availability, performance, and reliability of critical applications and services by designing and implementing robust monitoring, alerting, and optimization strategies. 
  • Define, measure, and maintain SLIs, SLOs, and error budgets to support service reliability goals. 
  • Partner with development teams to improve performance, reduce latency, and increase the resilience of applications in production. 
  • Work closely with platform and DevOps teams to ensure smooth alignment between infrastructure and application reliability. 
  • Define reliability standards and operational guardrails for platform capabilities and golden paths. 
  • Partner with platform engineering teams to design resilient self-service capabilities. 
  • Automate operational tasks such as deployments, rollbacks, scaling, failover, and recovery processes. 
  • Continuously improve CI/CD pipelines to reduce manual intervention and support safe, progressive delivery practices. 
  • Integrate automated validation, reliability checks, and operational guardrails into development and deployment workflows. 
  • Implement and maintain observability capabilities across production systems, including metrics, logs, traces, and dashboards. 
  • Develop dashboards, alerts, and operational views that provide real-time visibility into system health and application behavior. 
  • Act as a key responder during incidents, collaborating across teams to troubleshoot, mitigate, and resolve production issues. 
  • Conduct root cause analysis for incidents and drive long-term corrective actions to prevent recurrence. 
  • Run fire drills, game days, and chaos engineering exercises to validate system resilience under failure conditions. 
  • Monitor resource usage, capacity trends, and scaling behavior to support future growth and performance needs. 
  • Partner with security teams to ensure services align with security best practices, including secure communication, access controls, and data protection. 
  • Lead or contribute to regular production readiness and operational review meetings to assess system health, review incidents, and prepare for releases. 
  • Promote reliability engineering best practices across teams and help strengthen the overall operational maturity of the organization. 

What you will bring to the role:

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience. 
  • Proven experience in Site Reliability Engineering, Platform Engineering, or Infrastructure/DevOps roles with strong operational ownership 
  • Strong expertise in application monitoring, observability platforms, incident response, and troubleshooting in production environments. 
  • Strong understanding of reliability engineering concepts such as SLIs, SLOs, error budgets, alerting quality, and incident management. 
  • Proficiency in scripting and automation using tools and languages such as Python, Bash, Terraform, Ansible, or similar. 
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud. 
  • Strong knowledge of CI/CD pipelines, deployment automation, and progressive delivery practices. 
  • Strong knowledge of infrastructure as code and configuration management approaches. 
  • Experience with containerization and orchestration, such as Docker and Kubernetes. 
  • Strong problem-solving skills, operational judgment, and attention to detail. 
  • Excellent communication and collaboration skills, with the ability to work effectively across engineering, platform, and security teams. 
  • Experience with chaos engineering practices and tools. 
  • Experience supporting internal platforms or platform engineering teams. 
  • Familiarity with developer portals, golden paths, service catalogs, or self-service platform patterns. 
  • Understanding of developer experience metrics and operational maturity for internal platforms. 
  • Familiarity with microservices architectures and multi-tenant environments. 
  • Experience with modern observability stacks and telemetry standards. 
  • Understanding of UCaaS and CCaaS platforms, especially voice and communication service flows. 
  • Experience leading reliability initiatives, incident reviews, or production improvement programs. 
  • Familiarity with capacity planning, resilience testing, and operational readiness practices. 

Diversity, Inclusion, and Equal Opportunity

747,000+ hidden jobs like this

Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.