About the team The Origination Decisions team builds and operates the machine-learning-powered system that decides whether to approve loan applications and under which conditions. The team is small (4 people) and every member owns a vertical slice of the product end-to-end — from data pipelines through model training to production deployment — for a subset of lending products. You will therefore not only lead improvements in your area of expertise, but also regularly use the full stack as an end-user, giving you first‑hand insight into what works and what doesn’t. The role You will own the production lifecycle of our ML-based decision services: deploying them reliably, monitoring them continuously, and making them easy to evolve. This is not a traditional DevOps or SRE role. You need to understand how machine‑learning systems fail — silently degrading predictions, distribution shifts, broken upstream schemas that subtly bias features — and design safeguards that catch these issues before they reach customers. Key responsibilities Deployment & release management Design and maintain the promotion pipeline from pull request to dev, staging, and production, including the criteria and automated checks at each gate. Manage containerized services on Kubernetes: image optimization, resource scaling, granular per‑decider deployments. Coordinate schema and API changes with the teams that maintain the upstream and downstream .NET / TypeScript services. Testing & quality gates Strengthen automated PR checks: decision‑impact visualizations, anomaly detection on training data and back‑populated predictions, and integration of upstream/downstream service code into automated LLM‑assisted reviews. Improve the Bruno API test suites that run against the dev environment after every merge, balancing coverage with cost. Extend the staging validation system that replays production traffic: detect divergences in computed features, approval statistics, and schema conformance between staging and production models. Monitoring & observability Design and maintain production monitoring: dashboards, alerts, and cross‑service distributed tracing of the full onboarding flow. Define and track ML‑specific health metrics (approval rates, score distributions, feature drift) alongside standard service metrics (latency, error rates, resource usage). Build tooling that transforms the internal decision trace into human‑readable explanations for operations and compliance stakeholders. Reliability & graceful degradation Coordinate with upstream data providers to define fallback strategies when external data is unavailable (secondary providers, default values, deferred decisions). Extend the input‑validation framework so that non‑critical schema violations fall back to safe defaults (with alerts) while critical violations block the decision, and simulate the impact of those fallbacks on decision quality. API design & integration Design and implement new endpoints as the product evolves (e.g., counter‑offers, intermediary onboarding steps, modified loan conditions). Integrate new data sources into the online decision path — including features from video‑call analysis and a low‑latency feature store for returning customers — in coordination with the pipeline engineer. Performance optimization Profile and optimize inference time: replace heavy dependencies (e.g., LightGBM → ONNX), evaluate faster data‑processing libraries (e.g., Polars over pandas), and offload hot paths with compiled code where justified. Keep base Docker images lean and startup times low. Cross-team code review Review pull requests in adjacent repositories (primarily C# / .NET and TypeScript / React) that affect the services immediately upstream or downstream of the decision system, to catch integration issues early. Benefits Attractive compensation package, including stock options. Fast‑paced environment with significant growth opportunities. 15 annual vacation days + 7 annual personal days. Option to work remotely 3-4 days per week ; or fully‑remote (as long as you can come to CDMX ~twice a year) Flexible work schedule #J-18808-Ljbffr
Ml Production Engineer — Origination Decisions
AVIVACREDITO
ciudad de méxico, ciudad de méxico
Publicado hace 22 días
Denunciar empleo