Back to all jobs
I

Member of Technical Staff, Model Evaluation

Inception

Bay Area$200k–350kOn-site3mo ago
Employment
Full-time
Seniority
Staff

About the role

  • Design, develop, and maintain robust evaluation frameworks and benchmarks for measuring LLM performance across diverse tasks and domains.
  • Define and implement quantitative metrics that capture model quality, safety, reliability, and regression detection.
  • Build scalable, automated evaluation pipelines that integrate into model training and deployment workflows.
  • Conduct rigorous statistical analysis of model outputs to identify failure modes, biases, and performance gaps.
  • Partner with product and customer-facing teams to translate real-world use cases into meaningful evaluation criteria.
  • BS/MS/PhD in Computer Science, Machine Learning, Statistics, or a related field (or equivalent experience).
  • At least 2 years of experience in ML evaluation, applied ML research, or a related engineering role.
  • Strong understanding of LLM fundamentals (autoregressive generation, instruction tuning, RLHF, in-context learning, decoding strategies).
  • Proficiency in Python and ML frameworks such as PyTorch.
  • Experience designing and implementing evaluation metrics and benchmarks for generative models.
  • Solid foundation in statistics, experimental design, and hypothesis testing.
  • Experience with version control (Git) and containerization (Docker).
  • Excellent communication skills with the ability to distill complex evaluation results into actionable insights.
  • Experience with human-in-the-loop evaluation systems (Likert-scale annotation, pairwise preference ranking, red-teaming).
  • Familiarity with LLM safety and alignment evaluation (toxicity, hallucination detection, factual grounding).
  • Knowledge of existing benchmark suites (MMLU, HumanEval, HELM, BIG-Bench) and their limitations.
  • Experience building evaluation infrastructure at scale using cloud platforms (AWS, GCP, Azure).
  • Familiarity with MLOps practices and CI/CD pipelines for model validation.
  • Experience with data engineering, large-scale data labeling, or synthetic data generation for evaluation purposes.

Compensation

  • Work with World-Class Talent: Collaborate with the inventors of diffusion models and leading AI researchers
  • Shape Foundational Technology: Your decisions will influence how the next generation of AI products are built and used
  • Immediate Impact: Join at the ground floor where your contributions directly shape product direction and company trajectory
  • Competitive salary and equity in a rapidly growing startup
  • Flexible vacation and paid time off (PTO)
  • Health, dental, and vision insurance
  • 401k match
  • Catered meals (breakfast, lunch, & dinner)
  • Commuter subsidies
  • A collaborative and inclusive culture

Perks & benefits

  • 401k
  • Vision Insurance
  • Unlimited Vacation
  • Paid Time Off
  • Pension Matching
  • Equity Compensation

764,000+ hidden jobs like this

Inception and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.