Lead the development and operations of AI and machine learning systems by building scalable, secure, and automated infrastructure across multi-cloud environments.
Drive the production lifecycle of LLM applications, RAG pipelines, and ML models while ensuring performance, observability, and compliance.
Responsibilities:
Build and maintain CI/CD and continuous training pipelines across AWS and Azure platforms.
Design and implement LLMOps frameworks including RAG pipelines and vector database management.
Develop data pipelines to integrate legacy systems with cloud-based ML workflows.
Implement automated model evaluation frameworks for LLMs and traditional ML models.
Deploy monitoring solutions for model performance, drift, latency, and cost management.
Manage infrastructure using Infrastructure as Code tools.
Collaborate with analytics platforms to ensure seamless data integration.
Work with IT and security teams to manage networking and access controls.
Optimize model serving for scalability and performance using containerization and serverless solutions.
Establish version control for prompts, models, and datasets.
Support data science workflows by automating feature engineering and deployment processes.
Implement security controls to prevent vulnerabilities and ensure compliance.
Skills:
AWS services including SageMaker and Bedrock.
Azure AI services and cloud ecosystem.
Python, SQL, and PySpark.
Containerization using Docker and Kubernetes.
Orchestration tools such as Airflow, Kubeflow, or Step Functions.
Vector databases such as OpenSearch, Pinecone, or Azure AI Search.
Model evaluation frameworks and observability tools.
Infrastructure as Code tools such as Terraform or CloudFormation.
Strong analytical and problem-solving skills.
Excellent communication and collaboration skills.
Qualification And Education:
Bachelor’s degree in Computer Science or related field required.
Master’s degree in a quantitative discipline preferred.
Experience:
6+ years of engineering experience.
Minimum 3+ years of experience in MLOps or LLMOps in production environments.
Experience working in multi-cloud environments preferred.
Experience collaborating with data science and enterprise IT teams preferred.