Back to all jobs

- Employment
- Contract
About the role
#HPC #AI #GPU #CLUSTERS YOUR DAILY ROUTINE - Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams - Participate in an on-call rotation to handle incidents and ensure service continuity - Implement and maintain observability solutions to monitor AI infrastructure and application health - Contribute to AI infrastructure lifecycle management across different e
Read the full description on the original posting.
731,000+ hidden jobs like this
MARGO and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites