Site Reliability Engineer
Snapp
- Employment
- Fulltime Permanent
About the role
In this role, you will strengthen the SRE Platform team’s mission by advancing the foundational platforms that automate manual workflows and elevate system reliability. Your work will ensure our staging environments remain stable and production-like, empowering QA and development teams to test, validate, and deploy their applications with confidence. You will also contribute to operational excellence through active participation in the weekly on-call rotation, supporting consistent and dependable infrastructure performance.
Automate and optimize operational processes
Enhance and maintain the observability stack
Oversee test/staging environments management
Develop and support critical production components
Handle and resolve production incidents
Participate in the on-call rotation
Strong teamwork and collaboration skills
Solid understanding of SRE concepts, including SLIs, SLOs, SLAs, and Error Budgets
Proficiency in Python or another scripting language
Strong grasp of software engineering principles
Hands-on experience with observability and monitoring tools such as Prometheus and Grafana
Familiarity with logging stacks (e.g., ELK, Loki) and tracing systems (e.g., Jaeger, Tempo)
Understanding of RDBMS and Redis
Experience working with Kubernetes and related tooling (e.g., Helm)
713,000+ hidden jobs like this
Snapp and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites