This Is An Opportunity To
- Design, build, and maintain scalable, production-grade machine learning systems for real-time recommendations and generative services.
- Develop and own the MLOps pipelines for continuous integration, continuous delivery (CI/CD), training, validation, and monitoring of all recommendation models.
- Engineer robust data pipelines using big data technologies to process vast datasets for model training and feature engineering.
- Implement and optimize both traditional ML models and state-of-the-art Generative AI models (including LLMs) for low-latency serving and high-throughput environments.
- Collaborate closely with Applied Researchers to translate novel algorithms and research prototypes into hardened, production-ready code.
- Develop and manage the APIs and infrastructure necessary for serving recommendations and integrating with other systems.
- Champion software engineering best practices, including code reviews, testing, and documentation, within the machine learning team.
- Monitor system performance, identify and resolve production issues, and continuously improve the reliability and efficiency of our ML services.
- Mentors other team members through code reviews, technical guidance, architecture design, and pair programming.
Qualifications
- MS in Computer Science or related area with 6 years of relevant work experience (or BS/BA with 8 years) in ML / AI / Data Engineering
- Expert in production engineering practices and software development in an OO language (Scala, Java, etc.)
- Extensive experience in big data distributed processing frameworks, e.g. Apache Hadoop, Spark, Flink
- Experience with ML frameworks like TensorFlow and PyTorch from a production perspective. Experience with serving frameworks (TensorFlow Serving, TorchServe, NVIDIA Triton) and libraries for LLM operations (LangChain, Hugging Face Transformers) preferred.
- Proven ability to build and manage CI/CD pipelines for ML models, including proficiency with containerization (Docker, Kubernetes).
- Experience with using cloud services, big data pipelines and databases, e.g. AWS, GCP, Azure
- Proven ability to design and build scalable, distributed systems and expose their functionality through well-designed RESTful or gRPC APIs
- A masterful understanding of the challenges and requirements of running machine learning in a live, 24/7 production environment, including monitoring, alerting, and incident response