LLM Evaluation
Understands how to evaluate LLM outputs and the inherent challenges (non-determinism, quality measurement, regression detection)
Hands-On Engineer
Not just an architect - writes code, debugs production issues, deploys their own work
Preferred / Differentiators
Built multi-step agentic workflows with tool use and function calling
Experience with agent orchestration frameworks (LangGraph, CrewAI, Claude Agent SDK, Google ADK, OpenAI ADK)
Built guardrails, fallbacks, or graceful degradation for AI systems
Streaming inference and async agent orchestration
Cost/latency optimization: caching, batching, prompt compression
ML observability tools: Langfuse, Arize, Braintrust, W&B
Retrieval systems (vector search, hybrid search) - as a tool, not the focus
Screening Questions for Candidates
"Describe a production AI agent or skill system you built. What broke and how did you fix it? "
"Have you built MCP servers/integrations or custom tool-use systems for LLMs? "
"How do you evaluate whether an LLM-based feature is working well? What makes this hard? "
"Walk me through how you'd deploy and scale an AI service on Kubernetes. "
Not a Fit If
Primarily a model trainer/fine-tuner (we're not training models)
AI experience is mainly academic, research, or tutorial-based
No production systems experience (only notebooks/demos)
Looking for entry-level role with heavy mentorship
Background is primarily data science/analytics rather than engineering
"Architects " who don't write or deploy code themselves