About the role Our Origination Decisions team builds the systems that decide, in real time, whether to grant a loan to an applicant and under which conditions (amount, term, interest rate). The team is small (4 people) and each member owns a slice of the stack end-to-end: aDeciderarchitect, who designs how decisions are composed and how we model profit; aData Scientist, who trains the most accurate ML models feeding those decisions; aDeploymentengineer, who ships the deciders to production and owns the quality gates around them; and theData Pipeline engineerwe are hiring with this role. On top of your specialty, you will own a subset of our loan products end-to-end: you will build their datasets, train their models, configure their decider and follow them to production. This "dogfooding" keeps you close to the pain points your pipeline creates and is the main feedback loop that drives the roadmap of your area. What you will own As the Data Pipeline engineer you are the main guarantor that the rest of the team always hasfresh, trustworthy, and easy-to-use datasetsto train models, analyze behavior and make decisions on. Concretely, you will: Build and maintain the data pipeline Own the team's Dagsterpipeline. Keep assets fresh, observable, and cheap to recompute. Make it obvious to other team members which datasets exist, what they contain, and how to consume them. Bring in new data sources Partner with external data providers on proofs-of-concept: organize backpopulation runs, send them the required samples, store and version the returned data, and evaluate whether the signal is worth productizing. Similarly, explore the data available internally that is under-used for purpose of origination decisions. For instance, transcript of collections calls could become features (for loan renovations), or groundtruth (to get a better picture of the customer than just the payments they made). When a source is promising, integrate it end-to-end: reconcile backpopulation dumps with the live API feed, extract features consistently from both, and expose them to downstream consumers. Design feature computations, not just move data around Some of the pipeline work is pure data engineering (joins, aggregations, cleaning), but a lot of it is closer to applied math and ML: DesigningLong Term Value formulasthat chain per-loan profit estimates with time-discounting and population averages for unseen future states (e.g. "average profit of a 3rd loan for customers similar to this one"), so the team can compare counterfactual policies such as "what would the LTV be if we only used base decider X?". Buildingoffline feature storesfor known customers and serving them through a low-latency store so the online decider can use information that doesn't fit in the request payload. Runningreject inferenceas a recurring process: periodically sampling past rejected applications, pulling fresh credit reports, turning them into pseudo-groundtruths, and merging them into training datasets. UsingLLMs and other models inside the pipelinewhen it is the right tool (e.g. extracting features from video-call transcripts, nowcasting loan profits and defaults). Implementingfeature pre-selectioninside the pipeline (ranking by predictive power, de-correlating, keeping the top ~N) so that the datasets we ship are an order of magnitude smaller than today without losing signal. Own data quality Add tests to most Dagster assets and make a deliberate choice for each one: does a failure block downstream assets, trigger an alert, or simply get logged? Guarantee that refactors and migrations do not silently change the value of existing features. When something breaks, investigate quickly, fix at the root, and leave behind a new test closer to the source of the problem so the class of bug cannot come back unnoticed. Be a user of your own platform You will also be responsible for a subset of our products: building their datasets, training ML models on them, configuring a decider and following it into production. Feedback you get as a user directly feeds the priorities of your pipeline work, and gives you concrete grounds to coordinate with the data scientist, decider and deployment engineers on cross-team improvements. What success looks like after 12 months The team trusts the datasets by default: if a model behaves oddly, the first hypothesis is no longer "maybe the pipeline is wrong". At least one new external data source has been integrated end-to-end, from POC to being used in a production decider. The most important internal data sources are transformed into features available to the data scientist. Training datasets are noticeably smaller and faster to load, thanks to in-pipeline feature pre-selection, without measurable loss in model performance. For the products you own, you have shipped at least one improved decider to production, and the lessons from that experience have shaped concrete improvements in the shared pipeline. Benefits Attractive compensation package, including stock options. Fast-paced environment with significant growth opportunities. 15 annual vacation days + 7 annual personal days. Option to work remotely 3-4 days per week ; or fully-remote (as long as you can come to CDMX ~twice a year) Flexible work schedule #J-18808-Ljbffr
Data Pipeline Engineer, Origination Decisions
AVIVACREDITO
ciudad de méxico, ciudad de méxico
Publicado hace 23 días
Denunciar empleo