QU

Senior MLOps Engineer

Quantiphi

9 months ago

5 - 7 years

Hybrid

Bengaluru, Karnataka, Karnataka, India

  • Lead the configuration and optimization of Multi-GPU, Multi-Node Deep Learning job scheduling, ensuring efficient computation and data processing.
  • Develop and maintain complex shell scripts for various system automation tasks, enhancing efficiency and reducing manual intervention.
  • Monitor system performance, identify bottlenecks, and implement necessary adjustments to ensure high availability and reliability.
  • MLOps

    Azure ML

    Pytorch

    TensorFlow

    Python

    shell scripting

    CI/CD pipeline

    Job description & requirements

    Roles and Responsibilities:


    • Design, deploy, and maintain distributed systems using Kubernetes and Slurm for optimal resource utilization and workload management.
    • Lead the configuration and optimization of Multi-GPU, Multi-Node Deep Learning job scheduling, ensuring efficient computation and data processing.
    • Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions.
    • Develop and maintain complex shell scripts for various system automation tasks, enhancing efficiency and reducing manual intervention.
    • Monitor system performance, identify bottlenecks, and implement necessary adjustments to ensure high availability and reliability.
    • Troubleshoot and resolve technical issues related to the distributed system, job scheduling, and deep learning processes.
    • Stay updated with industry trends and emerging technologies in distributed systems, deep learning, and automation.


    Skill Set Needed: 


    • Hands-on experience in MLOps - Azure ML (preferred), MLFlow, Kubeflow, AutoML etc.
    • Good to have at least one ML framework understanding - PyTorch / TensorFlow.
    • Good with Python. Experience in shell/linux scripting.
    • Good understanding of logical networks. 
    • Proven experience in designing, deploying, and managing distributed systems, with a focus on Kubernetes and Slurm.
    • Sufficient understanding of AI Model Training and Deployment and Strong background in Multi-GPU, Multi-Node Deep Learning job scheduling and resource management.
    • Proficiency in Linux systems, particularly Ubuntu, and the ability to navigate and troubleshoot related issues.
    • Extensive experience creating complex shell scripts for automation and system orchestration.
    • Familiarity with continuous integration and deployment (CI/CD) processes.
    • Excellent problem-solving skills and the ability to diagnose and resolve technical issues promptly.
    • Strong communication and collaboration skills to work effectively within a cross-functional team.


    Good to Have:

    • Previously working on NVIDIA Ecosystem or well aware of NVIDIA Ecosystem - Triton Inference Server, CUDA. Experience in working with On-prem NVIDIA GPU servers.

    Experience :

    5 - 7 years

    Job Domain/Function :

    Machine Learning

    Job Type :

    Hybrid

    Employment Type :

    Full Time

    Number Of Position(s) :

    1

    Educational Qualifications :

    Bachelor's Degree

    Location 1 :

    Bengaluru, Karnataka, India, Bengaluru, Karnataka, India

    Location 2 :

    Mumbai, Maharashtra, India,

    Location 3 :

    Trivandrum, Kerala, India,

    Create alert for similar jobs

    QU

    Quantiphi

    Similar Jobs

    Senior MLOps Engineer-Quantiphi-Bengaluru, Karnataka, India-5 - 7 years