Back to all jobs
I
Member of Technical Staff, Model Evaluation
Inception
Bay Area$200k–350kOn-site3mo ago
- Employment
- Full-time
- Seniority
- Staff
About the role
- Design, develop, and maintain robust evaluation frameworks and benchmarks for measuring LLM performance across diverse tasks and domains.
- Define and implement quantitative metrics that capture model quality, safety, reliability, and regression detection.
- Build scalable, automated evaluation pipelines that integrate into model training and deployment workflows.
- Conduct rigorous statistical analysis of model outputs to identify failure modes, biases, and performance gaps.
- Partner with product and customer-facing teams to translate real-world use cases into meaningful evaluation criteria.
- BS/MS/PhD in Computer Science, Machine Learning, Statistics, or a related field (or equivalent experience).
- At least 2 years of experience in ML evaluation, applied ML research, or a related engineering role.
- Strong understanding of LLM fundamentals (autoregressive generation, instruction tuning, RLHF, in-context learning, decoding strategies).
- Proficiency in Python and ML frameworks such as PyTorch.
- Experience designing and implementing evaluation metrics and benchmarks for generative models.
- Solid foundation in statistics, experimental design, and hypothesis testing.
- Experience with version control (Git) and containerization (Docker).
- Excellent communication skills with the ability to distill complex evaluation results into actionable insights.
- Experience with human-in-the-loop evaluation systems (Likert-scale annotation, pairwise preference ranking, red-teaming).
- Familiarity with LLM safety and alignment evaluation (toxicity, hallucination detection, factual grounding).
- Knowledge of existing benchmark suites (MMLU, HumanEval, HELM, BIG-Bench) and their limitations.
- Experience building evaluation infrastructure at scale using cloud platforms (AWS, GCP, Azure).
- Familiarity with MLOps practices and CI/CD pipelines for model validation.
- Experience with data engineering, large-scale data labeling, or synthetic data generation for evaluation purposes.
Compensation
- Work with World-Class Talent: Collaborate with the inventors of diffusion models and leading AI researchers
- Shape Foundational Technology: Your decisions will influence how the next generation of AI products are built and used
- Immediate Impact: Join at the ground floor where your contributions directly shape product direction and company trajectory
- Competitive salary and equity in a rapidly growing startup
- Flexible vacation and paid time off (PTO)
- Health, dental, and vision insurance
- 401k match
- Catered meals (breakfast, lunch, & dinner)
- Commuter subsidies
- A collaborative and inclusive culture
Perks & benefits
- 401k
- Vision Insurance
- Unlimited Vacation
- Paid Time Off
- Pension Matching
- Equity Compensation
764,000+ hidden jobs like this
Inception and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites