Software Engineer, Runtime

FuriosaAI

Seoul

Apply

Employment: Full-time

About the role

About the Job

Designs and implements the low-level runtime stack that drives FuriosaAI's NPU hardware to its theoretical limits — from device driver interfaces and DMA-based I/O to kernel execution scheduling, multi-node inference, and embedded firmware.

Responsibilities

Develops the low-level runtime responsible for DMA-based I/O operations and kernel execution scheduling, maximizing inference throughput while minimizing end-to-end latency.
Builds and optimizes asynchronous execution pipelines that orchestrate data movement and compute across the NPU hardware.
Enables multi-node inference by implementing foundational communication primitives, including RDMA-based data transfer for low-latency, high-bandwidth inter-node operations.
Develops embedded firmware (PERT) that runs on the NPU's integrated ARM core, managing on-device scheduling, synchronization, and hardware resource control.
Profiles and tunes system-level performance across the full runtime stack — from firmware to user-space — to eliminate bottlenecks in real-world inference workloads.

Minimum Qualifications

Bachelor's degree in Computer Science or equivalent work experience. Strong systems programming background with 3+ years of experience in Rust, C, or C++.

Bachelor's degree in Computer Science, Electrical Engineering, or equivalent work experience.
Strong communication skills for cross-team requirement gathering and technical alignment.
3+ years of systems programming experience in Rust, C, or C++.
Solid understanding of computer architecture fundamentals: memory hierarchy, cache coherency, OS, DMA, interrupts, and MMIO.

Preferred Qualifications

Deep expertise in low-latency runtime systems, embedded firmware development, or high-performance I/O — especially in the context of accelerator hardware.
Experience designing and implementing low-latency asynchronous execution models and scheduling systems.
Experience with DMA engines, scatter-gather I/O, or other zero-copy data transfer mechanisms.
Experience developing embedded firmware for ARM-based processors (bare-metal or lightweight RTOS environments).
Familiarity with RDMA technologies and high-performance networking for distributed or multi-node systems.
Experience with CUDA low-level runtime internals such as CUDA Graphs, stream-based execution, and asynchronous kernel launch optimization.
Experience with kernel-level performance optimizations (e.g., Linux kernel modules, eBPF, perf, ftrace).
Understanding of deep learning inference workloads and their hardware execution characteristics.
Experience with profiling and performance tuning of system software on accelerator or SoC platforms.

Contact

recruit@furiosa.ai

731,000+ hidden jobs like this

FuriosaAI and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime