QU

Senior MLOps Engineer

Quantiphi

9 months ago

5 - 7 years

Hybrid

Mumbai, Maharashtra, Maharashtra, India

  • Design, deploy, and maintain distributed systems using Kubernetes and Slurm, Lead the configuration and optimization of Multi-GPU, Multi-Node Deep Learning job scheduling
  • Develop and maintain complex shell scripts for various system automation tasks, enhancing efficiency and reducing manual intervention.
  • Monitor system performance, identify bottlenecks, Troubleshoot and resolve technical issues
  • ML Ops

    Azure ML

    MLFlow

    Kubeflow

    Pytorch

    Tensorflow

    Python

    Linux and Shell Scrypting

    Distributed Systems

    AI modelling

    Deep Learning

    Job description & requirements

    Roles and Responsibilities:


    Role: Senior Platform Engineer (MLOps)

    Experience Level: 3 to 6 Years 

    Location: Mumbai/Bangalore (Hybrid)


    • Design, deploy, and maintain distributed systems using Kubernetes and Slurm for optimal resource utilization and workload management.
    • Lead the configuration and optimization of Multi-GPU, Multi-Node Deep Learning job scheduling, ensuring efficient computation and data processing.
    • Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions.
    • Develop and maintain complex shell scripts for various system automation tasks, enhancing efficiency and reducing manual intervention.
    • Monitor system performance, identify bottlenecks, and implement necessary adjustments to ensure high availability and reliability.
    • Troubleshoot and resolve technical issues related to the distributed system, job scheduling, and deep learning processes.
    • Stay updated with industry trends and emerging technologies in distributed systems, deep learning, and automation.


    Skill Set Needed: 

    • Hands-on experience in MLOps - Azure ML (preferred), MLFlow, Kubeflow, AutoML etc.
    • Good to have at least one ML framework understanding - PyTorch / TensorFlow.
    • Good with Python. Experience in shell/linux scripting.
    • Good understanding of logical networks. 
    • Proven experience in designing, deploying, and managing distributed systems, with a focus on Kubernetes and Slurm.
    • Sufficient understanding of AI Model Training and Deployment and Strong background in Multi-GPU, Multi-Node Deep Learning job scheduling and resource management.
    • Proficiency in Linux systems, particularly Ubuntu, and the ability to navigate and troubleshoot related issues.
    • Extensive experience creating complex shell scripts for automation and system orchestration.
    • Familiarity with continuous integration and deployment (CI/CD) processes.
    • Excellent problem-solving skills and the ability to diagnose and resolve technical issues promptly.
    • Strong communication and collaboration skills to work effectively within a cross-functional team.


    Good to Have:

    • Previously working on NVIDIA Ecosystem or well aware of NVIDIA Ecosystem - Triton Inference Server, CUDA. Experience in working with On-prem NVIDIA GPU servers.


    Experience :

    5 - 7 years

    Job Domain/Function :

    Machine Learning Engineer

    Job Type :

    Hybrid

    Employment Type :

    Full Time

    Number Of Position(s) :

    1

    Educational Qualifications :

    Bachelor's Degree

    Location :

    Mumbai, Maharashtra, India, Mumbai, Maharashtra, India

    Create alert for similar jobs

    QU

    Quantiphi

    Similar Jobs