Back to all jobs
B
GPU Systems Engineer 4
Base-2 Solutions, LLC
Bethesda1w ago
About the role
Position Summary
Support enterprise AI mission systems by designing, developing, and optimizing GPU clusters, with deep focus on operating systems, hardware, GPU platforms, and high-speed networking in a secure customer environment.
Essential Duties and Responsibilities
- Design, configure, and maintain GPU clusters.
- Collaborate with a multidisciplinary team to define and optimize architectures for performance, power efficiency, and required features.
- Work closely with AI/ML engineers to integrate GPUs with Linux-based systems.
- Optimize GPU drivers for compatibility, reliability, and performance.
- Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers.
- Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments.
- Leverage Bash, Python, Ansible, Puppet, and Salt for tooling and automation.
- Maintain technical documentation, architectural specifications, and Linux best practices.
- Support ATO activities and ensure compliance with federal security standards.
Required Qualifications
- Active TS/SCI with ability to obtain a CI Polygraph.
- Bachelor's degree with a minimum of ten years of experience in the category field.
- Experience managing NVIDIA GPU data center platforms, including DGX, HGX, H200, H100, and L4s.
- Knowledge of enterprise server components, including storage/network controllers, HBAs, and SSDs.
- Strong expertise with Linux distributions, including RHEL, Ubuntu, Oracle, and Rocky.
- Excellent problem-solving skills and the ability to collaborate within a team.
- Meet DoD 8570.11 IAT Level II certification requirements at a minimum; IAT Level III is also acceptable.
- U.S. citizenship is required due to the nature of the government contracts supported.
Preferred Qualifications
- Experience with Kubernetes cluster management and AI/ML workflow orchestration, including Argo, Airflow, and Kubeflow.
- Familiarity with GPU virtualization and cloud computing.
- Experience with Prometheus and Grafana for monitoring.
- Knowledge of distributed resource scheduling systems such as Slurm, LSF, or similar tools.
Required Education and Experience Equivalency
Required Certifications
- DoD 8570.11 IAT Level II certification: Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP.
Required Security Clearance
- Active TS/SCI with ability to obtain a CI Polygraph.
481,000+ hidden jobs like this
Base-2 Solutions, LLC and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites