Back to all jobs

- Employment
- Full-time
- Seniority
- Lead
About the role
- Integration APIs and microservices
- Multi-region Kubernetes clusters (cloud & on‑prem)
- CI/CD and GitOps tooling (GitHub-centric)
- Internal Developer Platform (IDP) components
- Critical shared infrastructure (RTE, Redis, Harbor, Consul, Chart Museum, API Gateway, etc.)
What you will be doing:
- Lead and grow a small DevOps engineering team supporting Unite Services and Unite Integrations.
- Drive team planning and execution across operational efficiency, security, reliability, and platform maturity epics (e.g., quarterly OKRs/initiatives).
- Provide technical mentorship on Kubernetes, GitHub, CI/CD, and cloud infrastructure.
- Collaborate closely with Engineering, SRE, Security, and Ops on roadmap, incident resolution, and cross-team initiatives.
- Own the operational health of Unite Services and Unite Integrations: Lead production deployments and release processes for backend services (e.g., Unite Notifications, integrations platform, Salesforce-related services).
- Oversee multi-environment support (DEV/QA/PROD and special environments like RTE).
- Operate and evolve Kubernetes clusters (cloud and on‑prem), including:
- Cluster upgrades and migration playbooks
- Ingress/Nginx and service mesh / networking changes
- Resource/capacity optimization (CPU throttling, scaling, etc.)
- Manage supporting infrastructure for Unite platforms:
- Redis, Consul, Harbor, Chart Museum, API gateways, message queues (e.g., RabbitMQ)
- Contribute to and adopt Internal Developer Platform (IDP) capabilities (e.g., API Gateway, Apache Flink management, Crossplane-based infrastructure).
- Lead migration and consolidation of repos and pipelines to GitHub (from TFS/Azure DevOps and others).
- Design, implement, and maintain GitHub-based CI/CD:
- GitHub Actions workflows
- Self-hosted GitHub runners (including K8s-based and Windows runners)
- Versioning and branching strategies
- Introduce and maintain GitOps practices for infrastructure and application deployments.
- Automate repetitive tasks and operational workflows via scripting and infrastructure-as-code.
- Drive security initiative epics (e.g., Ransomware preparedness, vulnerability remediation like Redis Lua RCE, tagging in Qualys/CrowdStrike).
- Work with Security to design and implement:
- Rate limiting, WAF rules, and API protection (e.g., Cloudflare for SCIM and public APIs)
- Secrets management and secure configuration
- Define and execute disaster recovery and ransomware recovery strategies for UST / Unite Services / Integrations.
- Improve reliability and quality of platforms via SLIs/SLOs, operational runbooks, and post-incident improvements.
- Own observability for Unite Services and Unite Integrations:
- Metrics, logs, traces (e.g., Prometheus, Grafana, ELK, etc.)
- Dashboards for capacity (Kubernetes CPU usage, throttling, etc.) and error budgets
- Lead incident response for platform issues:
- Triage, mitigation, communication, and follow-up actions
- Coordination with product/engineering teams during outages
- Create and maintain detailed runbooks and recovery procedures, especially for:
- RTE environment, K8s clusters, Harbor, Consul, Chart Museum
- Disaster recovery and ransomware scenarios
- Produce and maintain clear, actionable documentation:
- Infrastructure design and configuration
- CI/CD and GitOps processes
- Recovery and maintenance procedures
- RFCs for new platform capabilities (e.g., API Gateway, Apache Flink)
- Contribute to process improvements in CI/CD, deployment strategies, and developer experience.
- Help define and maintain product maturity models for Unite Integrations and related platforms.
What you will bring to the role:
- Strong background in platform / DevOps engineering for complex distributed systems.
- Deep understanding of infrastructure components: compute, networking, storage, DNS, certificates, load balancing, and how they interact.
- Kubernetes & Containers
- Deep knowledge of Kubernetes (cluster operations, upgrades, scaling, security, ingress, RBAC).
- Experience with Docker/containerization in production.
- Cloud & On‑Prem
- Solid experience with at least one major cloud provider (Azure preferred; AWS or GCP a plus).
- Comfort working in hybrid environments (cloud + on‑prem).
- CI/CD & GitHub
- Strong experience designing and running CI/CD pipelines.
- Hands-on with GitHub Actions and self-hosted runners (including scaling and security).
- Experience with other CI/CD tools (Azure DevOps, GitLab, ArgoCD) is a plus.
- Infrastructure as Code & Automation
- Proficiency with IaC tools (Terraform, Crossplane, Helm, etc.).
- Strong scripting/coding skills (e.g., Python, Go, Bash, PowerShell) to automate complex operational workflows.
- Config Management
- Experience with tools like Ansible, Chef, or similar for configuration management and automation.
- Observability & Monitoring
- Experience with Prometheus, Grafana and centralized logging stacks (e.g., ELK, Splunk or similar).
- Ability to build meaningful dashboards and alerts, and interpret telemetry to drive improvements.
- Security & Compliance
- Understanding of security best practices across network, infrastructure, and application layers.
- Experience with WAFs, rate limiting, vulnerability management tools (e.g., Qualys, CrowdStrike) is a plus.
- Proven experience leading DevOps / platform teams or being a strong technical lead.
- Ability to mentor engineers and raise the technical bar for the team.
- Experience leading small/medium-sized projects using Agile/Scrum or similar methodologies.
- Strong communication skills, able to translate technical constraints into clear options and decisions for stakeholders.
- Demonstrated ability to own outcomes: from design through implementation, rollout, and ongoing operations.
- Strong troubleshooting skills and calm under pressure during incidents.
- Experience defining or implementing Internal Developer Platforms.
- Experience with API Gateway and stream processing platforms (e.g., Apache Flink).
- Background in ransomware preparedness, disaster recovery planning, or business continuity for critical systems.
- Experience operating systems at multi-region or multi-tenant scale.
Diversity, Inclusion, and Equal Opportunity
759,000+ hidden jobs like this
Intermedia Intelligent Communications and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites