Data has become a new currency in the current data-driven economy. The EU has launched the EU data market monitoring tool, which continuously monitors the impact of the data economy in the member states. The tool has identified that in 2021 the EU there are more than 190.000 data supplier companies, and more than 560.000 data users. For the revenues of the data companies, the 2022 figures are at 84B Euros with forecasts towards 114B in 2025 and 137B in 2030.
Data quality is a persistent issue in the ongoing digital transformation of European businesses. Some studies estimate that the curation and cleaning of data can take up as much as 80% of data professionals time, which is time not spent on producing insights or actual products and services. The quality of data used within a business is of utmost importance, since according to reports, bad data or dirty data cost the US 3 trillion USD per year. For instance, many business datasets can contain a high number of anomalies, duplicates, errors and missing values which degrade the value of the data for making business decisions. Unhandled bias in the data, such as when a dataset fails to account properly for minority labels (i.e. for gender, age, origin, or any other label in the data), can result in machine learning models that are skewed towards unfair outcomes. Additionally, much of the existing data in business silos often lacks the proper documentation and annotation that would allow professionals to properly leverage it towards downstream decision making tasks.
One of the main pillars of the SEDIMARK project is to promote data quality for the data that will be shared on the marketplace. SEDIMARK will build a complete data curation and quality improvement pipeline that will be provided to the data providers so that they can assess the quality of their data and clean them in order to improve the quality. This data curation pipeline will require minimum human intervention from domain experts to provide optimal results, but will also be fully customisable for experts to unlock maximum performance. This will be achieved by exploring state of the art techniques in Auto-ML (automated machine learning) and Meta-ML, both of which can be applied to transform the data with minimal human supervision, by learning from previous tasks.
Additionally, the marketplace within SEDIMARK will prioritise and promote data providers who undertake the effort to curate their data before sharing them widely. SEDIMARK will implement a range of transparent data quality metrics that will show statistics about the data and these will be displayed side by side with the data offerings on the marketplace. This will help consumers to find high quality data so that they minimise the time they spend preprocessing the data for their services. Moreover, an efficient recommender system within SEDIMARK will also help consumers to easily find high quality, highly rated offerings within their domain.
In conclusion, SEDIMARK aims to provide tools for improving the data quality both for providers and consumers, boosting the data economy, while at the same time saving significant time from data scientists, allowing them to focus more their time on actually extracting value from the high quality data producing new products and services instead of spending the majority of their time cleaning the data themselves. The SEDIMARK team in the Insight Centre for Data Analytics of the University College Dublin (UCD) builds on Insight’s data expertise and is leading the activities to define efficient metrics for data quality and develop an automated and simplified data curation and quality improvement pipeline for data providers and users to check and improve their datasets.