Back to all jobs
V
Research Fellowship - Mechanistic Interpretability
vmax
San Francisco1mo ago
About the role
<h2><strong>About <em>V<sub>max</sub></em></strong></h2>
<p><em>V<sub>max</sub></em> is an applied research lab developing AI capable of open-ended learning. We are building systems to exceed humans in all capacities by optimizing beyond the local maxima of learning from human expertise.</p>
<h2>About the role</h2>
<p>LLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers. </p>
<p>This 3 to 6 month fellowship is for PhD students or equivalent early-career researchers who want to work at the intersection of mechanistic interpretability and reinforcement learning. You will own a focused research project, work closely with Vmax technical staff, and contribute to research publications.</p>
<h2 data-section-id="r8dte7" data-start="1137" data-end="1156">Responsibilities</h2>
<ul data-start="1158" data-end="2415">
<li data-section-id="15ssf1m" data-start="1158" data-end="1316">Develop mechanistic interpretability methods for understanding internal representations, features, circuits, and computations in language models and agents.</li>
<li data-section-id="161scmh" data-start="1317" data-end="1476">Investigate how model internals can be used to generate intrinsic rewards, auxiliary objectives, diagnostics, or training signals for reinforcement learning.</li>
<li data-section-id="19nvjxr" data-start="1477" data-end="1637">Design and run experiments that test whether interpretability-derived signals improve learning, exploration, generalization, robustness, or sample efficiency.</li>
<li data-section-id="i8nt98" data-start="1638" data-end="1813">Compare internally derived rewards against baselines such as human-generated verifiers, reward models, task-level outcome rewards, and standard intrinsic motivation methods.</li>
<li data-section-id="1rfl071" data-start="1814" data-end="1984">Use techniques such as probing, activation analysis, sparse autoencoders, causal interventions, feature attribution, or representation analysis to study model behavior.</li>
<li data-section-id="9lnhda" data-start="1985" data-end="2159">Analyze failure modes, including reward hacking, spurious features, non-causal correlations, objective misspecification, and overfitting to narrow evaluation distributions.</li>
<li data-section-id="timnai" data-start="2160" data-end="2299">Build research code, evaluation harnesses, and experimental infrastructure that make results reproducible and useful to the broader team.</li>
<li data-section-id="1jl7l9d" data-start="2300" data-end="2415">Communicate research progress clearly through written updates, internal presentations, and final project outputs.</li>
</ul>
<h2 data-section-id="1f4wdkt" data-start="2417" data-end="2437">Role Requirements</h2>
<ul data-start="2439" data-end="3761">
<li data-section-id="5m18h3" data-start="2439" data-end="2732">Currently enrolled in a PhD program in machine learning, computer science, artificial intelligence, computational neuroscience, mathematics, or a related technical field. Exceptional candidates with equivalent research experience may also be considered.</li>
<li data-section-id="qpw4ak" data-start="2733" data-end="2929">Track record of research excellence or strong research promise, demonstrated through publications, preprints, open-source work, technical projects, competitions, or publicly available artifacts.</li>
<li data-section-id="18f5yz9" data-start="3035" data-end="3188">Working understanding of reinforcement learning.</li>
<li data-section-id="1chv4e5" data-start="3189" data-end="3318">Familiarity with mechanistic interpretability, representation analysis, or empirical methods for understanding neural networks.</li>
<li data-section-id="u0z0yw" data-start="3319" data-end="3433">Strong programming ability in Python and experience with at least one major ML framework such as PyTorch or JAX.</li>
<li data-section-id="19t71t6" data-start="3670" data-end="3761">Clear written and verbal communication of technical ideas.</li>
</ul>
<h2 data-section-id="17hey2t" data-start="3763" data-end="3778">Nice to have</h2>
<ul data-start="3780" data-end="4977">
<li data-section-id="1tpaa6i" data-start="3780" data-end="3951">Experience with LLM post-training methods </li>
<li data-section-id="tcxrzw" data-start="4299" data-end="4461">Familiarity with intrinsic motivation, unsupervised RL, auxiliary objectives, representation learning for RL, or curiosity-driven learning.</li>
<li data-section-id="1d5jevc" data-start="4590" data-end="4720">Experience with scalable ML experimentation, distributed training, experiment tracking, or reproducible research infrastructure.</li>
<li data-section-id="128d1mn" data-start="4847" data-end="4977">Interest in turning mechanistic understanding into practical training methods, rather than only analyzing models after training.</li>
</ul>
<h2><strong>Role specific location policy</strong></h2>
<ul>
<li>This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement</li>
</ul>
755,000+ hidden jobs like this
vmax and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites