Back home
About the role
Responsibilities:
- Expand and enhance our Grafana/Prometheus monitoring solution.
- Consolidate logs, metrics, and system health data for actionable insights and streamlined troubleshooting.
- Configure automated alerts based on predefined thresholds and anomaly detection to ensure rapid incident response.
- Diagnose and resolve infrastructure incidents between 9AM-9PM EDT, leveraging monitoring tools and system logs.
- Implement corrective actions and preventive measures to avoid recurrence.
- Analyze and optimize database queries, indexing, and partitioning strategies for enhanced performance and scalability.
- Regularly inspect database tables, identifying areas for improvement and recommending necessary maintenance activities.
- Monitor database usage trends to predict and proactively address scaling needs, preventing performance issues.
- Improve platform-wide security monitoring with real-time analytics and automated anomaly detection to quickly identify and respond to threats.
- Utilize security tools to simulate realistic attack scenarios to uncover vulnerabilities.
- Conduct ongoing vulnerability assessments and automated penetration testing.
- Strengthen and document incident response procedures, ensuring clear cross-team communication and swift incident remediation.
- Develop and maintain robust CI/CD pipelines for efficient code integration, testing, and deployment.
- Implement and integrate comprehensive testing frameworks, including unit, integration, and end-to-end tests, ensuring high-quality code delivery.
- Collaborate with teams to enforce industry-standard security checks and continuous monitoring across the software delivery lifecycle.
Qualifications:
- Extensive experience in backend infrastructure operations, including monitoring, incident management, database optimization, and security.
- Strong proficiency with Grafana, Prometheus, PostgreSQL (Aurora), and CI/CD pipeline tools.
- Proven ability to implement proactive security measures and conduct continuous assessments.
- Excellent problem-solving and incident management skills.
- Strong collaboration and communication skills, capable of cross-team coordination and documentation.
- Availability during core operational hours (9AM-9PM EDT).
About the company
A
A16Z Speedrun
No company description available.
774,000+ hidden jobs like this
A16Z Speedrun and thousands of companies post here first, often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Recommended
- Unlimited applications — free stops at 10
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites