We are building enterprise-grade AI platforms for observability, evaluation, command and control, and safeguards across generative and agentic AI systems deployed in complex, regulated environments.
In this role you will design, integrate, and operate core AI platform capabilities ensuring that intelligent systems run safely, reliably, and in full alignment with enterprise expectations for security, auditability, resiliency, and operational excellence. Your work will span agent tracing, evaluation pipelines, guardrails and intervention services, registry and governance tooling, and operational control experiences.
In This Role, You Will
Design and build production multi-agent systems, coordinating specialized agents through orchestrator patterns with clearly defined tool-use protocols and inter-agent communication
Implement Model Context Protocol (MCP) servers to connect AI agents with external tools, data sources, APIs, and enterprise services
Build and operate RAG pipelines and knowledge graph-backed retrieval systems, leveraging vector and graph databases to ground agent reasoning in accurate, contextual data
Develop LLM-powered analysis capabilities β semantic understanding of logs, code, and configurations β to drive intelligent automation within multi-step agent workflows
Design and operate distributed observability systems using OpenTelemetry, building self-healing automation that detects anomalies, performs root-cause analysis, and triggers autonomous remediation
Build event-driven agent pipelines using Kafka and message queue systems, ensuring reliable and ordered processing across distributed agent components
Implement agent governance controls including safety guardrails, approval workflows, blast-radius limits, and audit logging to ensure agents operate within defined boundaries
Integrate AI-powered quality and compliance gates into CI/CD pipelines, enabling automated validation at each stage of the delivery lifecycle
Translate enterprise requirements into modular, maintainable agent architectures and contribute to large-scale agentic AI strategy
Collaborate with peers, client engineering leads, and program managers to resolve technical challenges, meet delivery targets, and communicate progress clearly
Lead projects and act as an escalation point, providing mentorship and technical guidance to less experienced engineers
Maintain strong operational rigor: runbooks, incident response procedures, performance regression gating, and documentation for audit and governance
Required Qualifications
4+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, or education
2+ years of hands-on experience building and deploying generative AI or agentic AI systems in production environments
Strong proficiency in Python; experience designing and consuming REST and/or gRPC APIs
Demonstrated experience with LLM integration, prompt engineering, and tool-use patterns in multi-step AI workflows
Experience with at least one agentic AI framework (e.g., LangGraph, AutoGen, CrewAI, OpenAI Swarm, Google ADK, or Claude Agent SDK)
Solid understanding of distributed systems, event-driven architecture, and microservices design
Experience with cloud-native infrastructure and containerized deployments (Docker, Kubernetes)
Strong written and verbal communication skills with the ability to document technical designs, present to stakeholders, and produce clear operational artifacts
Desired Qualifications
Hands-on experience building production agentic AI systems using one or more frameworks such as OpenAI Swarm, AutoGen, CrewAI, LangGraph, Google ADK, or Claude Agent SDK
Experience implementing Model Context Protocol (MCP) integrations for tool use, context management, and agent-to-agent communication
Experience designing and implementing RAG (Retrieval-Augmented Generation) pipelines for knowledge-grounded AI applications
Proficiency with vector databases such as Pinecone, Weaviate, Qdrant, pgvector, Redis Vector DB, or FAISS
Experience with graph databases, particularly Neo4j, for relationship-aware data modeling and querying
Hands-on experience with distributed observability and self-healing systems β instrumenting services, detecting anomalies, and triggering automated remediation
Experience with OpenTelemetry for distributed tracing, metrics, and logging across multi-service architectures
Experience with event streaming platforms such as Kafka and message queue systems such as RabbitMQ, ZeroMQ, or Redis MQ
Proficiency in Python (5+ years); REST and/or gRPC; event-driven design patterns
Experience with identity and access management (OAuth scopes, RBAC/ABAC) and sensitive data handling for enterprise applications
Good to Have
Experience with containerization and orchestration using Docker and Kubernetes
Familiarity with sandboxed code execution environments such as Docker, Firecracker, etc