Site Reliability Engineer / SRE Engineer
finnomena
- Employment
- Full-time
About the role
About the Role
Are you a passionate Site Reliability Engineer who thrives at the intersection of software engineering and operations? Do you have a proven track record of building resilient systems, driving down toil, and turning complex incidents into lasting improvements? If so, we want to hear from you!
We're looking for a skilled and experienced SRE to join our team and champion a culture of reliability excellence. You'll play a pivotal role in designing scalable infrastructure, partnering with developers to bake reliability into every stage of the software lifecycle, and ensuring our systems are observable, fault-tolerant, and always ready for whatever comes next.
Responsibilities
- Design, implement, and maintain scalable, highly available infrastructure and platform services.
- Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to drive reliability decisions.
- Lead incident response, root cause analysis, and post-mortems, converting learnings into actionable reliability improvements.
- Collaborate with development teams to embed reliability practices — including capacity planning, load testing, and fault injection — into the development lifecycle.
- Build and maintain robust observability pipelines covering metrics, logging, distributed tracing, and alerting.
- Identify and systematically eliminate toil through automation, tooling, and process improvements.
- Participate in on-call rotations, maintaining and continuously improving runbooks and escalation procedures.
- Foster a culture of reliability and operational excellence across engineering teams through documentation, training, and knowledge sharing.
Qualifications
- 3+ years of experience as a Site Reliability Engineer, Platform Engineer, or related role.
- Strong software engineering fundamentals with proficiency in at least one scripting or programming language (e.g., Python, Go, Bash).
- Hands-on experience with Kubernetes (GKE, kubectl, Helm) and containerization (Docker).
- Experience with CI/CD pipelines and infrastructure-as-code tools (e.g., Terraform, Ansible, GitLab CI, Jenkins).
- Solid understanding of SRE principles including SLOs, error budgets, and the balance between reliability and velocity.
- Experience with observability tooling (e.g., Prometheus, Grafana, Datadog, ELK stack, OpenTelemetry).
- Strong incident management skills, including structured on-call practices and blameless post-mortem culture.
- Good communication, collaboration, and problem-solving skills with the ability to work across cross-functional teams.
Bonus Points
- Certification in cloud platforms (e.g., GCP Professional Cloud DevOps Engineer, AWS DevOps Engineer, CKA/CKAD).
- Experience with Google Cloud Platform (GCP) services and architecture.
- Familiarity with chaos engineering practices and tooling (e.g., Chaos Monkey, LitmusChaos).
What We Offer
- The opportunity to work on cutting-edge infrastructure and make a real impact on our systems' reliability and scale.
- A collaborative and supportive work environment with a strong focus on learning and development.
- Hybrid working environment.
- Competitive compensation and benefits package.
- The chance to be part of a team that is passionate about engineering excellence and innovation.
If you're passionate about building world-class Android experiences and want your work to directly impact how people grow their wealth, we encourage you to apply.
755,000+ hidden jobs like this
finnomena and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites