DevJobs

MLOps Engineer

Overview
Skills
  • Python Python
  • PyTorch PyTorch
  • CI/CD CI/CD
  • Git Git
  • Docker Docker
  • GPUDirect
  • h5
  • NFS over RDMA
  • parquet
  • RocE
  • slurm
  • testing frameworks


Q is searching for an MLOps engineer that will construct and maintain stable infrastructures and pipelines for data curation, model training and ongoing ML research. Our company continuously collects large amounts of multimodal data which should be packaged in the best formats suitable for high-speed reading by large clusters of GPUs. We also conduct multiple streams of research and development on shared resources, which requires efficient resources allocation, experiment tracking and evaluation frameworks. The work will be in direct collaboration with the ML team, to collect their requirements and observe the work process in order to improve and accelerate models training and inference.

Responsibilities:

  • Develop robust and reliable data pipelines which will process our continuously collected proprietary data for training and validation.
  • Optimize & develop orchestration processes to test and manage AI models.
  • Create model-quality and performance dashboards to continuously monitor the improvement of the model.
  • Collaborate with cross-functional teams to identify bottlenecks and implement solutions to improve workflow efficiency.
  • Create easy to access evaluation tools which will be used by different teams in the company to test and inspect our models.
  • Develop shared utilities for setting up systems, running tests, and recording results.
  • Design, deploy, and maintain performant and scalable processes to acquire and manipulate data and make datasets more easily accessible to the team.

Requirements:

  • Strong experience with PyTorch, including optimizing and deploying machine learning models for production environments.
  • Experience with work scheduling processing at scale (slurm).
  • Expertise in loading data to GPUs, utilizing tools such as RocE, GPUDirect, and NFS over RDMA.
  • Deep understanding of different file formats and structures (h5, parquet, etc.).
  • Programming skills in the Python programming language and common software development tools: Git, testing frameworks, CI/CD, and Docker.
  • Ability to design, implement, and manage CI/CD pipelines that connect to real hardware, ensuring automated and reliable model deployment and updates.
  • Passionate about and well-versed in DevOps/MLOps practices.
  • Demonstrated expertise in troubleshooting and optimizing both software and hardware systems to support efficient ML model deployment.
  • Strong teamwork and communication skills to collaborate with cross-functional teams, including data scientists, software engineers, and hardware specialists
Q.ai