Data Engineering

Mirelo AI

3 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Tech stack

Amazon Web Services

Microsoft Azure

Data Cleansing

Information Engineering

File Systems

Distributed Data Store

FFmpeg

Machine Learning

Video Editing

Data Processing

Graphics Processing Unit (GPU)

Slurm

Job description

Data acquisition

Develop and run scalable infrastructure for acquiring massive-scale audio (sound and music) and multimodal video-audio datasets
Coordinate data transfers from licensing partners and turn heterogeneous sources into training-ready datasets

Annotation and data quality

Obtain detailed annotations for audio and video data (descriptions, musical attributes, audio attributes, …)
Use state-of-the-art ML models for data cleaning, processing and filtering
Ensure data quality by automated tools and manual evaluation studies
Build scalable tools to analyze our datasets (compute statistics, create visualizations, …)

Efficient workflows and collaboration

Optimize and parallelize data processing workflows to handle massive-scale datasets efficiently across both CPUs and GPUs
Work directly in the model development loop, updating datasets as training trajectories reveal what we're missing

Requirements

Do you have experience in Python?, * Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis

Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
Experience with audio and video processing libraries (ffmpeg, …) and an understanding of their performance characteristics
Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs
Knowledge of machine learning techniques for data cleaning and preprocessing, * Have built or contributed to large-scale data acquisition systems and understand the operational challenges
Have implemented data processing and cleaning pipelines at scale
Familiarity with audio and video annotation processes for ML and experience with the specifics of audio data
Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes