Back to all jobs
FuriosaAI logo

Software Engineer, Runtime

FuriosaAI
Seoul
Employment
Full-time

About the role

About the Job

Designs and implements the low-level runtime stack that drives FuriosaAI's NPU hardware to its theoretical limits — from device driver interfaces and DMA-based I/O to kernel execution scheduling, multi-node inference, and embedded firmware.

Responsibilities

  • Develops the low-level runtime responsible for DMA-based I/O operations and kernel execution scheduling, maximizing inference throughput while minimizing end-to-end latency.

  • Builds and optimizes asynchronous execution pipelines that orchestrate data movement and compute across the NPU hardware.

  • Enables multi-node inference by implementing foundational communication primitives, including RDMA-based data transfer for low-latency, high-bandwidth inter-node operations.

  • Develops embedded firmware (PERT) that runs on the NPU's integrated ARM core, managing on-device scheduling, synchronization, and hardware resource control.

  • Profiles and tunes system-level performance across the full runtime stack — from firmware to user-space — to eliminate bottlenecks in real-world inference workloads.

Minimum Qualifications

  • Bachelor's degree in Computer Science or equivalent work experience. Strong systems programming background with 3+ years of experience in Rust, C, or C++.

  • Bachelor's degree in Computer Science, Electrical Engineering, or equivalent work experience.

  • Strong communication skills for cross-team requirement gathering and technical alignment.

  • 3+ years of systems programming experience in Rust, C, or C++.

  • Solid understanding of computer architecture fundamentals: memory hierarchy, cache coherency, OS, DMA, interrupts, and MMIO.

Preferred Qualifications

  • Deep expertise in low-latency runtime systems, embedded firmware development, or high-performance I/O — especially in the context of accelerator hardware.

  • Experience designing and implementing low-latency asynchronous execution models and scheduling systems.

  • Experience with DMA engines, scatter-gather I/O, or other zero-copy data transfer mechanisms.

  • Experience developing embedded firmware for ARM-based processors (bare-metal or lightweight RTOS environments).

  • Familiarity with RDMA technologies and high-performance networking for distributed or multi-node systems.

  • Experience with CUDA low-level runtime internals such as CUDA Graphs, stream-based execution, and asynchronous kernel launch optimization.

  • Experience with kernel-level performance optimizations (e.g., Linux kernel modules, eBPF, perf, ftrace).

  • Understanding of deep learning inference workloads and their hardware execution characteristics.

  • Experience with profiling and performance tuning of system software on accelerator or SoC platforms.

Contact

  • recruit@furiosa.ai

731,000+ hidden jobs like this

FuriosaAI and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.