Overview About FirstIgnite: FirstIgnite is the AI-powered business development platform for university technology transfer offices (TTOs). We help research institutions turn breakthroughs into partnerships, licenses, and companies by combining deep LLM-driven workflows with the relationships that actually move deals forward. Our product suite spans expert discovery, grants search, and AI-driven outreach — all built on a modern, agentic stack. We ship fast, we measure everything, and we believe evaluations are the difference between AI features that demo well and AI features that work in production. Responsibilities Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents. Define what \"good\" means: Partner with product and domain experts to translate fuzzy customer outcomes (for example, \"+does this surface the right principal investigator?+\") into precise, measurable rubrics. Own the feedback loop: Instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests. Ship quickly under uncertainty: Run 48-hour eval sprints for greenfield features with no production traffic; bootstrap quality signal from scratch. Model and prompt evaluation: Run rigorous A/B comparisons across models (OpenAI, Anthropic, open-weight), prompt strategies, and agent architectures; quantify tradeoffs between cost, latency, and quality. Agent evaluation: Help measure multi-step agent behavior built on the OpenAI Agents SDK, Vercel AI SDK, and Temporal Cloud — including tool-use correctness, trajectory quality, and end-to-end task completion. Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel like unit tests. Qualifications 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems. Hands-on experience with LLM evaluation frameworks — Promptfoo, Braintrust, LangSmith, OpenAI Evals, DeepEval, or equivalent in-house tooling. Strong grasp of LLM-as-judge methodology, including failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them. Statistical literacy — ability to distinguish real regression from noise and design experiments that answer the intended question. Product instincts: ability to analyze customer success transcripts, identify failure modes, and ship an eval for each by end of week. Strong written communication: Evals are useful only if engineers trust and read the results. Preferred Qualifications Experience evaluating retrieval systems (RAG, hybrid search, reranking) over structured or semi-structured domains like research, grants, or patents. Exposure to agent orchestration frameworks (Temporal, LangGraph, OpenAI Agents SDK) and challenges of evaluating multi-step, tool-using systems. Background in information retrieval, search relevance, or a research-adjacent domain. Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) use to label and review model outputs. Why This Role You'll be the first dedicated evals hire; scope, standards, and tooling are yours to define. AI quality is existential for our product; this role is on the critical path to revenue rather than a compliance task. Small, senior team: ~10 engineers, distributed globally, with a focus on shipping and measuring. Direct access to real-world, high-stakes LLM use cases across research discovery, grants, and outbound, in a customer base that values accuracy. #J-18808-Ljbffr
Ai Evaluation Engineer
FIRSTIGNITE
valle de chalco solidaridad, valle de chalco solidaridad
Publicado hace 7 días
Denunciar empleo