Senior Data Scientist (Metadata & AI Hydration) | Global Media Services - Open LATAM, Canada and Europe #2981 The Mission Help unlock and transform 25 years of world-class, legacy journalism for the AI era. You'll be embedded in our Global Media Services team, acting as the primary data architect and technical consultant to build a unified semantic foundation. Your mission is to ingest, clean, and enrich a massive historical archive—making decades of unstructured text deeply searchable and ready for next-generation AI products and high‑value LLM licensing deals The Role Working as the vital link between raw data and advanced search infrastructure, you’ll spend your time designing custom semantic taxonomies and building highly scalable NLP pipelines. This is a highly collaborative role where you will operate within a tightly aligned engineering triad interfacing directly with editorial stakeholders, Search Engineers, and Backend AI Engineer Strategic Taxonomy Design: Interface directly with editorial and product leaders across multiple publication brands to translate commercial and discovery goals into actionable, unified metadata schemas Pipeline Engineering & Hydration: Architect and deploy scalable ML/NLP pipelines to process legacy HTML and unstructured text, automating entity recognition (NER), thematic categorization, and sentiment analysis at scale Embedding Strategy Ownership: Evaluate, select, and fine‑tune text embedding models tailored specifically for dense, high‑quality journalistic content. Cross‑Functional Collaboration: Partner with Enterprise Search Engineers to map hydration schemas to OpenSearch cluster constraints, and work with Backend Engineers to ensure data structures support multi‑modal retrieval (Vector, Hybrid, and Graph) AI ROI Optimization: Actively balance model accuracy with computational cost, selecting cost‑optimized LLMs and smaller open‑source models for high‑velocity, batch‑processing data extraction. The Must‑Haves Experience: 5+ years of experience in Data Science, Machine Learning, or NLP engineering, specifically focused on text analytics and processing large‑scale unstructured data sets NLP & ML Frameworks: Hands‑on expertise with NLP frameworks (spaCy, NLTK, Transformers) and deploying smaller, open‑source LLMs for automated data extraction. Data Engineering Stack: Deep proficiency in Python (Pandas, PySpark, etc.) and building robust ETL pipelines capable of cleaning \"dirty,\" legacy historical data Search Infrastructure: Strong foundational knowledge of OpenSearch or Elasticsearch, particularly around vector dimensions, index mapping, and search relevance. Consultative Communication: Exceptional communication skills with the ability to lead technical discovery sessions with non‑technical business leaders and mediate competing requirements across different internal brands. Commercial Mindset: A strong focus on AI ROI, knowing how to balance model performance with the practical constraints of processing massive archives. The Bonus Media Industry Context: Prior experience building semantic tools, search architectures, or ontologies specifically for editorial, newsroom, or media environments. Graph Databases: Exposure to graph‑based data structures and relationship extraction models to power advanced discovery queries. Advanced Retrieval: Hands‑on experience designing complex hybrid search mechanisms (Lexicon + Vector). Editorial Empathy: You genuinely care about the value of journalistic archives and understand how content monetization goals map to technical metadata. Analytical Rigor: You are obsessed with semantic accuracy and relationship extraction, ensuring that rich metadata maps perfectly to physical search constraints. Triad‑First Collaborator: You thrive in multi‑disciplinary squads, seamlessly speaking the language of business stakeholders, search specialists, and backend application developers alike. Ownership‑Oriented: You take full accountability for the data enrichment strategy from architecture through ingestion, balancing long‑term scalability with immediate product needs. Cost‑Conscious Architect: You look at 25 years of data and instinctively think about token optimization, compute cost, and pipeline efficiency without sacrificing quality. #J-18808-Ljbffr
Senior Data Scientist (Metadata & Ai Hydration) (Latam, Canada And Europe)
X-TEAM
distrito federal, distrito federal
Publicado hace 7 días
Denunciar empleo