Must have
Cloud Expertise – Strong understanding of cloud platforms (Azure/AWS) and AI/ML components such as Databricks, Azure Cognitive Services, and MLflow.
Infrastructure as Code (IaC) – Hands-on experience with Terraform, and IaC orchestration tools like Terragrunt.
Scripting & Automation – Strong command-line proficiency with Bash/Python, or equivalent scripting languages.
Containerisation & Orchestration – Expertise in Kubernetes/Docker and how they optimise ML development workflows.
Monitoring & Observability – Experience with monitoring for ML-specific use cases.
Collaboration & Communication – Excellent written and verbal communication skills, with the ability to work in collaborative, multi-cultural teams.
Nice to have
ML Workflow Automation – Experience in ML pipeline orchestration, using tools such as Jenkins, GitHub Actions, or dedicated compute environments.
Model & Data Management – Familiarity with model registries, AI Agents, Retrieval-Augmented Generation (RAG) techniques, and frameworks like LangChain/LlamaIndex.
Hands-on experience with Databricks, Azure ML, or SageMaker.
Understanding of security best practices for MLOps, including data privacy & compliance in cloud platforms.
Knowledge of ML frameworks like TensorFlow, PyTorch, or Scikit-learn.
Experience working in complex enterprise environments with strict security and compliance requirements.
Strong networking fundamentals, including configuring and maintaining secure mTLS-based communication between services.
Excellent problem-solving skills and attention to detail.
Exposure to Java or R (optional but beneficial for enterprise AI environments).
Hands-on experience with stacks such as Prometheus, Grafana, Splunk, ELK and tuning observability for ML-specific use cases.
Role Responsibilities:
Automate & Optimise AI/ML Infrastructure – Enable scalable, repeatable, and secure AI/ML services for research and development (R&D).
Collaborate Across Teams – Work with ML Engineers, DevOps, and Software teams to design robust ML infrastructure and deployment strategies.
Evaluate & Integrate Emerging Technologies – Continuously assess and integrate MLOps best practices to enhance automation, efficiency, and security.
Monitor & Improve ML Operations – Implement proactive monitoring & alerting solutions to improve system performance, reliability, and operational insights.
Perform Research & Proof-of-Concepts (PoCs) – Conduct research and evaluate new technologies to drive innovation and improve AI/ML development and integration cycles.
Contribute to Internal & External Knowledge Sharing – Document findings, best practices, and PoCs to support broader engineering teams.