Back to all jobs
F

Infra Support Engineer

Fuku

Kuala Lumpur, Federal Territory of Kuala Lumpur, MalaysiaOn-site2mo ago
Employment
Full-time

About the role

Infra Support Engineer – GMI Global Infrastructure Team

Preferred Location:
- Taiwan
- Malaysia

Responsibilities:
- Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.
- Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.
- Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.
- Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.
- Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.
- Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents.
- Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes.
- Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents.
- Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures.
- Work collaboratively with developers and SRE teams to improve system reliability.

Qualifications:
- Bachelor’s degree in Computer Science or a related field.
- Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support.
- Hands-on Linux experience, including shell, kernel, and log management.
- Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs.
- Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager.
- Experience with Nvidia GPU infrastructure and Kubernetes.
- Comfortable collecting diagnostics, reading logs, and interpreting traces.
- Strong troubleshooting mindset and ability to follow runbooks under pressure.
- Excellent written and verbal communication skills for customer-facing incident handling.
- Willingness to work shifts and participate in on-call rotations.
- Bilingual in English and Chinese.

756,000+ hidden jobs like this

Fuku and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.