Site Reliability Engineer - IDP

Intermedia Intelligent Communications

WorldwideRemote1mo ago

Apply

Employment: Full-time

About the role

What you will be doing:

Ensure the availability, performance, and reliability of critical applications and services by designing and implementing robust monitoring, alerting, and optimization strategies.
Define, measure, and maintain SLIs, SLOs, and error budgets to support service reliability goals.
Partner with development teams to improve performance, reduce latency, and increase the resilience of applications in production.
Work closely with platform and DevOps teams to ensure smooth alignment between infrastructure and application reliability.
Define reliability standards and operational guardrails for platform capabilities and golden paths.
Partner with platform engineering teams to design resilient self-service capabilities.
Automate operational tasks such as deployments, rollbacks, scaling, failover, and recovery processes.
Continuously improve CI/CD pipelines to reduce manual intervention and support safe, progressive delivery practices.
Integrate automated validation, reliability checks, and operational guardrails into development and deployment workflows.
Implement and maintain observability capabilities across production systems, including metrics, logs, traces, and dashboards.
Develop dashboards, alerts, and operational views that provide real-time visibility into system health and application behavior.
Act as a key responder during incidents, collaborating across teams to troubleshoot, mitigate, and resolve production issues.
Conduct root cause analysis for incidents and drive long-term corrective actions to prevent recurrence.
Run fire drills, game days, and chaos engineering exercises to validate system resilience under failure conditions.
Monitor resource usage, capacity trends, and scaling behavior to support future growth and performance needs.
Partner with security teams to ensure services align with security best practices, including secure communication, access controls, and data protection.
Lead or contribute to regular production readiness and operational review meetings to assess system health, review incidents, and prepare for releases.
Promote reliability engineering best practices across teams and help strengthen the overall operational maturity of the organization.

What you will bring to the role:

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Proven experience in Site Reliability Engineering, Platform Engineering, or Infrastructure/DevOps roles with strong operational ownership
Strong expertise in application monitoring, observability platforms, incident response, and troubleshooting in production environments.
Strong understanding of reliability engineering concepts such as SLIs, SLOs, error budgets, alerting quality, and incident management.
Proficiency in scripting and automation using tools and languages such as Python, Bash, Terraform, Ansible, or similar.
Experience with cloud platforms such as AWS, Azure, or Google Cloud.
Strong knowledge of CI/CD pipelines, deployment automation, and progressive delivery practices.
Strong knowledge of infrastructure as code and configuration management approaches.
Experience with containerization and orchestration, such as Docker and Kubernetes.
Strong problem-solving skills, operational judgment, and attention to detail.
Excellent communication and collaboration skills, with the ability to work effectively across engineering, platform, and security teams.

Experience with chaos engineering practices and tools.
Experience supporting internal platforms or platform engineering teams.
Familiarity with developer portals, golden paths, service catalogs, or self-service platform patterns.
Understanding of developer experience metrics and operational maturity for internal platforms.
Familiarity with microservices architectures and multi-tenant environments.
Experience with modern observability stacks and telemetry standards.
Understanding of UCaaS and CCaaS platforms, especially voice and communication service flows.
Experience leading reliability initiatives, incident reviews, or production improvement programs.
Familiarity with capacity planning, resilience testing, and operational readiness practices.

Diversity, Inclusion, and Equal Opportunity

747,000+ hidden jobs like this

Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime