Data Engineering
Role details
Job location
Tech stack
Job description
Data acquisition
- Develop and run scalable infrastructure for acquiring massive-scale audio (sound and music) and multimodal video-audio datasets
- Coordinate data transfers from licensing partners and turn heterogeneous sources into training-ready datasets
Annotation and data quality
- Obtain detailed annotations for audio and video data (descriptions, musical attributes, audio attributes, …)
- Use state-of-the-art ML models for data cleaning, processing and filtering
- Ensure data quality by automated tools and manual evaluation studies
- Build scalable tools to analyze our datasets (compute statistics, create visualizations, …)
Efficient workflows and collaboration
- Optimize and parallelize data processing workflows to handle massive-scale datasets efficiently across both CPUs and GPUs
- Work directly in the model development loop, updating datasets as training trajectories reveal what we're missing
Requirements
Do you have experience in Python?, * Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis
- Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
- Experience with audio and video processing libraries (ffmpeg, …) and an understanding of their performance characteristics
- Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs
- Knowledge of machine learning techniques for data cleaning and preprocessing, * Have built or contributed to large-scale data acquisition systems and understand the operational challenges
- Have implemented data processing and cleaning pipelines at scale
- Familiarity with audio and video annotation processes for ML and experience with the specifics of audio data
- Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes