AI-OPS Team Member | IT Infrastructure
We are seeking a skilled and motivated AI-OPS Team Member to join our IT Infrastructure – Tools & Collaboration department.
This is an exciting opportunity to be part of a newly established AI-OPS team, responsible for centralizing and optimizing the management of our organization’s AI-related infrastructure across all environments.
About the Role
As an AI-OPS Team Member, you will play a key role in managing and maintaining our AI infrastructure—both on-premise and in the cloud. You will ensure the availability, performance, scalability, and security of AI platforms, tools, and hardware resources that support our enterprise AI initiatives.
Key Responsibilities
- Operate and maintain AI infrastructure with a focus on stability, performance, and scalability.
- Deploy, configure, and troubleshoot AI platforms, containerized environments, and GPU-based workloads.
- Monitor and optimize resource utilization (CPU, GPU, memory, storage, network).
- Contribute to CI/CD processes and implement automation using Infrastructure as Code (IaC).
- Manage security, access controls, and compliance across AI systems.
- Collaborate with AI Platform Engineers and AI Security Engineers to resolve incidents and improve reliability.
- Maintain documentation, runbooks, and operational best practices.
- Continuously identify opportunities to improve efficiency and cost-effectiveness in AI infrastructure management.
Required Qualifications
- Proven experience in IT infrastructure operations and management.
- Hands-on experience with AI/ML platforms (e.g., MLflow, Kubeflow).
- Proficiency in CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions).
- Strong knowledge of cloud infrastructure (Azure preferred, AWS or GCP).
- Expertise in Docker and Kubernetes, especially for AI workloads.
- Experience with GPU management and optimization.
- Proficiency in Infrastructure as Code (Terraform, Ansible).
- Solid Linux administration skills.
- Familiarity with Prometheus, Grafana, and similar monitoring tools.
- Scripting ability in Python and Bash.
- Strong understanding of networking, distributed systems, and infrastructure security.
- Experience managing or deploying vector databases.
- Familiarity with AI infrastructure vendor solutions.
- Experience in infrastructure centralization projects.
- Relevant cloud or container certifications.