Back to all jobs
Intermedia Intelligent Communications logo

Site Reliability Engineer

Intermedia Intelligent Communications
GeorgiaRemote3mo ago
Employment
Full-time

About the role

Key Responsibilities

  • Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
  • Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
  • Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
  • Create and maintain dashboards and operational visibility (Grafana or equivalent)
  • Develop and maintain runbooks, operational playbooks, and incident response procedures
  • Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
  • Perform root-cause analysis, postmortems, and implement corrective/preventive actions
  • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
  • Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
  • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)

Skills, Knowledge and Expertise

  • Bachelor in Computer Science or related field 
  • Experience in SRE / Operations / DevOps with production incident ownership
  • Hands-on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
  • Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
  • Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
  • Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
  • Experience with Git-based workflows for monitoring-as-code and configuration management
  • Grafana administration and dashboard design standards
  • Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
  • Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
  • Messaging/cache/proxy operations: RabbitMQ, Redis, Nginx
  • Experience with Windows clustering or HA environments
  • Experience defining SLOs/SLIs and operational KPIs
  • Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
  • Experience with load balancing components ( F5 LTM, F5 GTM)
  • Experience with Virtualization platforms such as VMWare or HyperV
  • Experience with administering AWS or Azure tenants

  • Participation in a rotating on-call schedule (including nights/weekends as needed)
  • Ownership of incident response: rapid triage, escalation, mitigation, and follow-up improvements
  • Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR

Diversity, Inclusion, and Equal Opportunity

741,000+ hidden jobs like this

Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.