Back to all jobs
I

Devops & SysOps Architect

Integrant

EgyptRemote1mo ago

About the role

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.

 

This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will Do

Presales & Business Development

•       Partner with sales and solution teams to identify and qualify new opportunities

•       Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations

•       Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients

•       Prepare high-quality technical materials

•       Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals

 

In-Account Delivery — SysOps & DevOps Execution

•       Operate directly within client accounts as a senior SysOps/DevOps engineer

•       Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on

•       Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling

•       Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems

•       Serve as the senior escalation point for complex operational incidents within accounts

 

Architecture & Solution Design

•       Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments

•       Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms

•       Recommend and validate technology choices aligned to client scale, budget, and team maturity

•       Produce architecture decision records (ADRs), solution blueprints, and technical runbooks

Technical Competencies & Requirements

1. Architecture & System Design

•       Design production-grade multi-cluster Kubernetes platforms:

◦       RKE2, EKS (AWS), AKS (Azure) at enterprise scale

◦       GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools

◦       Hybrid cloud + on-premises HPC infrastructure

•       Define and document:

◦       Workload isolation: namespaces, MIG partitioning, multi-tenancy models

◦       Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium)

◦       Storage: Longhorn, Ceph, distributed and high-throughput file systems

 

2. Platform Engineering & GitOps Strategy

•       Define and enforce platform standards across the delivery lifecycle

•       GitOps tooling: ArgoCD, Fleet — declarative cluster management

•       CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote

•       Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible

•       Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev → QA → Prod)

 

3. AI / GPU Infrastructure Architecture  (Priority Competency)

•       Design and operate GPU compute platforms at scale:

◦       GPU Operator deployment and lifecycle management

◦       MIG (Multi-Instance GPU) partitioning for multi-tenant workloads

◦       Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins)

•       Understand AI workload classes and their infrastructure implications:

◦       Distributed training workloads (data/model/pipeline parallelism)

◦       Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization

•       Align infrastructure to the full AI stack:

◦       CUDA stack, cuDNN, NCCL collective communication libraries

◦       High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA

◦       GPUDirect RDMA / GPUDirect Storage for low-latency data paths

 

4. Observability & Reliability Engineering

•       Define and implement full-stack observability:

◦       Metrics: Prometheus, Thanos (long-term retention, multi-cluster)

◦       Logs: Loki, Fluent Bit

◦       GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems

•       Build operational frameworks:

◦       SLO / SLA definitions and error budget tracking

◦       Alerting strategy — noise reduction, severity routing

◦       Incident response playbooks and on-call runbooks

 

5. Security & Multi-Tenancy Architecture

•       Design zero-trust security postures for multi-tenant platforms

•       Secret management: HashiCorp Vault, External Secrets Operator

•       Identity and access: IAM, RBAC, SSO/OIDC integration

•       Network isolation: NetworkPolicy, micro-segmentation, mTLS

•       Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement

 

6. HPC, Data & Storage Architecture  (Priority Competency)

•       Understand the high-performance storage for AI/HPC workloads:

◦       GPUDirect Storage — bypassing CPU for GPU-native I/O

◦       Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block)

◦       Storage tiering, caching strategies, and data lifecycle management

•       Size and validate storage architectures against workload I/O profiles

 

7. Operational Leadership & Linux Systems

•       Lead incident response and root cause analysis (RCA) for critical production issues

•       Define upgrade strategies, change management procedures, and disaster recovery plans

•       Write and maintain runbooks, operational playbooks, and knowledge base content

•       Integrate organizational processes, compliance requirements, and security policies into operational frameworks

•       Deep Linux expertise:

◦       Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages)

◦       Storage I/O scheduling, NVMe optimization

◦       Network stack tuning for RDMA / InfiniBand

◦       System performance profiling and bottleneck analysis

 

Candidate Profile — Who You Are

•       you are comfortable running production systems.

•       You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity

•       You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment

•       You communicate technical complexity clearly — to engineers and to C-level stakeholders

•       You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations

•       You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions

•       You thrive in ambiguity and can scope both short POCs and long-horizon platform programs

Requirements

Required

•       10+ years in platform/infrastructure engineering, with at least 2 years in architect-level role

•       Proven hands-on experience operating Kubernetes at scale in production (multi-cluster, multi-tenant)

•       Significant Linux systems administration experience — kernel, networking, storage at a low level

•       HPC and/or GPU infrastructure experience — physical GPU servers, NCCL, InfiniBand, or high-speed fabrics

•       Demonstrable presales or client-facing experience

•       IaC experience: Terraform and/or Ansible in production environments

•       Strong understanding of GitOps and CI/CD pipelines in enterprise settings

 

Strongly Preferred

•       Experience with NVIDIA GPU Operator, MIG partitioning, Run:AI, or equivalent GPU scheduling tooling

•       Knowledge of distributed AI training infrastructure (PyTorch DDP, Horovod, DeepSpeed) from an infrastructure perspective

•       Familiarity with NVIDIA Triton Inference Server or TensorRT deployment pipelines

•       Experience with Weka, Ceph, or GPUDirect Storage in HPC/AI environments

•       Hands-on experience with Vault, External Secrets, and zero-trust network architectures

•       Exposure to bare-metal provisioning and HPC cluster management (Slurm, PBS, or equivalent)

 

Certifications (Advantageous)

•       CKA / CKS (Certified Kubernetes Administrator / Security Specialist)

•       RHCE / RHCA (Red Hat Certified Engineer / Architect)

•       AWS Solutions Architect / Azure Solutions Architect Expert

•       HashiCorp Terraform Associate or Vault Associate

•       NVIDIA DLI certifications (GPU computing, AI infrastructure)

Benefits

  • Why Integrant?
  • Competitive compensation package
  • PTO, full medical and dental coverage, etc.
  • Opportunity to travel and work onsite with U.S. customers
  • In-house Technical and English training programs
  • Dedicated learning time (check out our 4Plus1 Program) [link]
  • Interest free loans
  • Flexible work schedules
  • Perks: events, sponsored lunch, game area, rooftop hangout + more!

Perks & benefits

  • Dental Insurance
  • Paid Time Off

755,000+ hidden jobs like this

Integrant and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.