Site Reliability Engineer

Intermedia Intelligent Communications

GeorgiaRemote3mo ago

Apply

Employment: Full-time

About the role

Key Responsibilities

Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
Create and maintain dashboards and operational visibility (Grafana or equivalent)
Develop and maintain runbooks, operational playbooks, and incident response procedures
Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
Perform root-cause analysis, postmortems, and implement corrective/preventive actions
Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)

Skills, Knowledge and Expertise

Bachelor in Computer Science or related field
Experience in SRE / Operations / DevOps with production incident ownership
Hands-on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
Experience with Git-based workflows for monitoring-as-code and configuration management

Grafana administration and dashboard design standards
Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
Messaging/cache/proxy operations: RabbitMQ, Redis, Nginx
Experience with Windows clustering or HA environments
Experience defining SLOs/SLIs and operational KPIs
Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
Experience with load balancing components ( F5 LTM, F5 GTM)
Experience with Virtualization platforms such as VMWare or HyperV
Experience with administering AWS or Azure tenants

Participation in a rotating on-call schedule (including nights/weekends as needed)
Ownership of incident response: rapid triage, escalation, mitigation, and follow-up improvements
Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR

Diversity, Inclusion, and Equal Opportunity

741,000+ hidden jobs like this

Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime