Back to all jobs
V

Research Fellowship - Mechanistic Interpretability

vmax

San Francisco1mo ago

About the role

<h2><strong>About <em>V<sub>max</sub></em></strong></h2> <p><em>V<sub>max</sub></em> is an applied research lab developing AI capable of open-ended learning. We are building systems to exceed humans in all capacities by optimizing beyond the local maxima of learning from human expertise.</p> <h2>About the role</h2> <p>LLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.&nbsp;</p> <p>This 3 to 6 month fellowship is for PhD students or equivalent early-career researchers who want to work at the intersection of mechanistic interpretability and reinforcement learning. You will own a focused research project, work closely with Vmax technical staff, and contribute to research publications.</p> <h2 data-section-id="r8dte7" data-start="1137" data-end="1156">Responsibilities</h2> <ul data-start="1158" data-end="2415"> <li data-section-id="15ssf1m" data-start="1158" data-end="1316">Develop mechanistic interpretability methods for understanding internal representations, features, circuits, and computations in language models and agents.</li> <li data-section-id="161scmh" data-start="1317" data-end="1476">Investigate how model internals can be used to generate intrinsic rewards, auxiliary objectives, diagnostics, or training signals for reinforcement learning.</li> <li data-section-id="19nvjxr" data-start="1477" data-end="1637">Design and run experiments that test whether interpretability-derived signals improve learning, exploration, generalization, robustness, or sample efficiency.</li> <li data-section-id="i8nt98" data-start="1638" data-end="1813">Compare internally derived rewards against baselines such as human-generated verifiers, reward models, task-level outcome rewards, and standard intrinsic motivation methods.</li> <li data-section-id="1rfl071" data-start="1814" data-end="1984">Use techniques such as probing, activation analysis, sparse autoencoders, causal interventions, feature attribution, or representation analysis to study model behavior.</li> <li data-section-id="9lnhda" data-start="1985" data-end="2159">Analyze failure modes, including reward hacking, spurious features, non-causal correlations, objective misspecification, and overfitting to narrow evaluation distributions.</li> <li data-section-id="timnai" data-start="2160" data-end="2299">Build research code, evaluation harnesses, and experimental infrastructure that make results reproducible and useful to the broader team.</li> <li data-section-id="1jl7l9d" data-start="2300" data-end="2415">Communicate research progress clearly through written updates, internal presentations, and final project outputs.</li> </ul> <h2 data-section-id="1f4wdkt" data-start="2417" data-end="2437">Role Requirements</h2> <ul data-start="2439" data-end="3761"> <li data-section-id="5m18h3" data-start="2439" data-end="2732">Currently enrolled in a PhD program in machine learning, computer science, artificial intelligence, computational neuroscience, mathematics, or a related technical field. Exceptional candidates with equivalent research experience may also be considered.</li> <li data-section-id="qpw4ak" data-start="2733" data-end="2929">Track record of research excellence or strong research promise, demonstrated through publications, preprints, open-source work, technical projects, competitions, or publicly available artifacts.</li> <li data-section-id="18f5yz9" data-start="3035" data-end="3188">Working understanding of reinforcement learning.</li> <li data-section-id="1chv4e5" data-start="3189" data-end="3318">Familiarity with mechanistic interpretability, representation analysis, or empirical methods for understanding neural networks.</li> <li data-section-id="u0z0yw" data-start="3319" data-end="3433">Strong programming ability in Python and experience with at least one major ML framework such as PyTorch or JAX.</li> <li data-section-id="19t71t6" data-start="3670" data-end="3761">Clear written and verbal communication of technical ideas.</li> </ul> <h2 data-section-id="17hey2t" data-start="3763" data-end="3778">Nice to have</h2> <ul data-start="3780" data-end="4977"> <li data-section-id="1tpaa6i" data-start="3780" data-end="3951">Experience with LLM post-training methods&nbsp;</li> <li data-section-id="tcxrzw" data-start="4299" data-end="4461">Familiarity with intrinsic motivation, unsupervised RL, auxiliary objectives, representation learning for RL, or curiosity-driven learning.</li> <li data-section-id="1d5jevc" data-start="4590" data-end="4720">Experience with scalable ML experimentation, distributed training, experiment tracking, or reproducible research infrastructure.</li> <li data-section-id="128d1mn" data-start="4847" data-end="4977">Interest in turning mechanistic understanding into practical training methods, rather than only analyzing models after training.</li> </ul> <h2><strong>Role specific location policy</strong></h2> <ul> <li>This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement</li> </ul>

755,000+ hidden jobs like this

vmax and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.