YOUR MISSION
- Have ownership over the data warehousing and data harvesting pipeline
- Discover and acquire additional data sets for Machine Translation through researching open or closed source data, and/or web crawling
- Improve data pipelining in the Machine Translation team
- Ensure data and model scalability
YOUR RESPONSIBILITIES
- Design and implement scalable data processes (think terabytes of text) for data storing, versioning and documenting
- Develop automatic data harvesting services for monolingual and bilingual textual data
- Architect, develop and maintain data web services for consuming the harvested data
- Build and automate data preprocessing pipelines for the training of Machine Translation models
- Introduce data pipelining solutions to the data preprocessing needs using technologie like Airflow, Luigi, similar
- Work day-to-day with researchers to improve data collection processes
- Prepare data for Machine Translation model training
YOUR PROFILE
- You have 2+ years of experience working as Data Engineer, Python Engineer or related
- Strong software engineering experience with an eye for clean, future-proof code
- Strong data wrangling skills, including extracting, transforming, cleaning, and augmenting data, as well as standardizing data formats
- Comfortable with Linux/Unix tools and bash
- Experience working in the cloud (e.g. GCP, Azure, AWS, etc.)
- Experience with productionizing tools Docker, Jenkins, Github actions, Kubernetes
- Experience with Machine Learning and/or Data Science is a plus, but not required
- You have solid communication skills and experience interacting with stakeholders from a multidisciplinary team
- You are fluent in English and feel comfortable in a fast-paced and international environment