About FirstIgnite FirstIgnite is an AI‑powered business development platform that helps university technology transfer offices turn research breakthroughs into partnerships, licenses, and companies. We combine LLM‑driven workflows with human relationships to deliver expert discovery, grants search, and AI‑driven outreach. Role Overview We are hiring an AI Evaluation Engineer to own the quality bar for every LLM‑powered feature we ship. You will design, build, and scale the infrastructure that provides evidence on whether changes to prompts, models, or agents improve or degrade performance. This is a high‑leverage role that interacts closely with engineering, product, and applied AI teams. What You’ll Do Build evaluation infrastructure: design and maintain evaluation suites using Promptfoo, LLM‑as‑judge methods, and custom harnesses for features such as expert search, grants search, and AI SDR agents. Define what \"good\" means: translate fuzzy customer outcomes into precise, measurable rubrics in partnership with product and domain experts. Own the feedback loop: instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests. Ship quickly under uncertainty: run 48‑hour evaluation sprints for greenfield features and bootstrap quality signals from scratch. Model and prompt evaluation: conduct rigorous A/B comparisons across models, prompt strategies, and agent architectures, quantifying trade‑offs between cost, latency, and quality. Agent evaluation: measure multi‑step agent behavior, tool‑use correctness, trajectory quality, and end‑to‑end task completion for systems built on OpenAI Agents SDK, Vercel AI SDK, and Temporal Cloud. Raise the floor for the team: create templates, documentation, and tooling so every engineer can write and run evaluations as part of normal development. Evaluations should feel as natural as unit tests. Requirements 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems. Hands‑on experience with LLM evaluation frameworks such as Promptfoo, Braintrust, LangSmith, OpenAI Evals, DeepEval, or equivalent tooling. Strong grasp of LLM‑as‑judge methodology, its failure modes, and mitigation strategies. Statistical literacy: distinguish real regressions from noise and design experiments that answer the intended questions. Product instincts: identify critical failure modes from customer success transcripts and ship an evaluation for each within a week. Strong written communication skills to ensure engineers trust and read evaluation results. Preferred Qualifications Experience evaluating retrieval systems (RAG, hybrid search, re‑ranking) over structured or semi‑structured domains like research, grants, or patents. Exposure to agent orchestration frameworks such as Temporal, LangGraph, or OpenAI Agents SDK and the challenges of evaluating multi‑step, tool‑using systems. Background in information retrieval, search relevance, or a research‑adjacent field. Experience building internal tooling or dashboards that non‑engineers (PMs, domain experts) use to label and review model outputs. Why This Role You will be the first dedicated evaluator at FirstIgnite, shaping the scope, standards, and tooling. AI quality is central to our product success and revenue, and this role sits on the critical path to delivering accurate and trustworthy LLM solutions. You will work with a small, senior, distributed team of roughly ten engineers, with direct access to high‑stakes LLM use cases in research discovery, grants, and outbound outreach. #J-18808-Ljbffr
Ai Evaluation Engineer
FIRSTIGNITE
veracruz, veracruz
Publicado hace 7 días
Denunciar empleo