Back to all jobs

- Employment
- Full-time
About the role
Key Responsibilities
- Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
- Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
- Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
- Create and maintain dashboards and operational visibility (Grafana or equivalent)
- Develop and maintain runbooks, operational playbooks, and incident response procedures
- Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
- Perform root-cause analysis, postmortems, and implement corrective/preventive actions
- Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
- Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
- Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)
Skills, Knowledge and Expertise
- Bachelor in Computer Science or related field
- Experience in SRE / Operations / DevOps with production incident ownership
- Hands-on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
- Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
- Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
- Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
- Experience with Git-based workflows for monitoring-as-code and configuration management
- Grafana administration and dashboard design standards
- Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
- Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
- Messaging/cache/proxy operations: RabbitMQ, Redis, Nginx
- Experience with Windows clustering or HA environments
- Experience defining SLOs/SLIs and operational KPIs
- Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
- Experience with load balancing components ( F5 LTM, F5 GTM)
- Experience with Virtualization platforms such as VMWare or HyperV
- Experience with administering AWS or Azure tenants
- Participation in a rotating on-call schedule (including nights/weekends as needed)
- Ownership of incident response: rapid triage, escalation, mitigation, and follow-up improvements
- Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR
Diversity, Inclusion, and Equal Opportunity
741,000+ hidden jobs like this
Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites