Member of Technical Staff (MTS) - Multimodal Foundation Models
Deeproute.ai
- Seniority
- Staff
About the role
Focus
Multimodal Foundation Models · Representation Learning · Method Innovation
We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.
Ideal candidates should have:
- Strong experimental rigor
- Solid systems and modeling intuition
- Hands-on engineering ability
- Interest in scalable multimodal AI systems for real-world autonomy
We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.
Responsibilities
1. Large-Scale Foundation Model Pretraining
- Develop scalable pretraining pipelines for large-scale multimodal driving data
- Design and optimize training strategies for:
- Vision-language-action models
- Video foundation models
- Long-context temporal modeling
- Multimodal representation alignment
- Improve:
- Training stability
- Data efficiency
- Scaling efficiency
- Representation robustness
- Work on distributed training systems and large-scale model optimization using frameworks such as:
- PyTorch Distributed
- DeepSpeed
- Megatron-LM
2. Representation Learning & Method Innovation
- Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems
- Conduct architecture-level research on:
- Vision Transformers (ViT)
- Video / temporal architectures
- Multimodal fusion and alignment
- Embedding and retrieval systems
- Long-context and memory-efficient architectures
- Explore and improve:
- Pretraining objectives
- Loss functions
- Training paradigms
- Generalization and robustness
- Analyze model behavior through:
- Rigorous ablation studies
- Failure case analysis
- Representation probing and evaluation
3. Efficient Foundation Models & Scalable Deployment
- Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems
- Work on areas such as:
- Model quantization
- Knowledge distillation
- Efficient attention mechanisms
- Sparse architectures and Mixture-of-Experts (MoE)
- Long-context and memory-efficient modeling
- Inference acceleration and serving optimization
- Training and inference system efficiency
- Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments
Requirements
- MS or PhD in:
- Computer Vision
- Machine Learning
- Robotics
- Computer Science
- Related fields
- Strong understanding of:
- Foundation models
- Self-supervised learning
- Representation learning
- Multimodal learning
- Large-scale pretraining
- Hands-on experience with methods such as:
- CLIP
- DINO / DINOv2
- MAE
- Contrastive learning
- Masked modeling
- MoE or scalable transformer architectures
- Experience with one or more of the following is highly valued:
- Video foundation models
- Long-context modeling
- Retrieval systems
- Efficient inference
- Distributed training
- Model compression and deployment optimization
- Strong publication record in top-tier venues is preferred:
- CVPR
- ICCV
- ECCV
- NeurIPS
- ICLR
- ICML
756,000+ hidden jobs like this
Deeproute.ai and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites