D3.1 Energy efficient AI-based toolset for improving data quality. First version

SEDIMARK · December 19, 2023

This document is the first deliverable from WP3, aiming to provide a first draft of the SEDIMARK data quality pipeline. This document details how the pipeline aims to improve the quality of datasets shared through the marketplace while also addressing the problem of energy efficiency in the data value chain. This document is actually the first version of the deliverable, showing the initial ideas and initial implementation of the respective tools and techniques. An updated version of the deliverable with the final version of the data quality pipeline will be delivered in M34 (July 2025). This means that the document should be considered a “live” document that will be continuously updated and improved as the technical development of the data quality tools evolves.

The main goal of this document is to discuss how data quality is seen in SEDIMARK, what are the metrics defined in order to assess the quality of data that are generated by data providers, and what techniques will be provided to them for improving the quality of their data before sharing on the data marketplace. This will help the data providers to both optimise their decision-making systems for the Machine Learning (ML) models they train using their datasets, and to increase their revenues by selling datasets of higher quality and thus higher value. Regarding the first argument, it is well documented that low-quality data has a significant impact on business, with reports showing a yearly cost of around 3 Trillion USD, and that knowledge workers waste 50% of their time searching for and correcting dirty data [1]. It is evident that data providers will hugely benefit from automated tools to help them improve their data quality, either without any human involvement or with minimum human intervention and configuration.

The document presents high-level descriptions of the concepts and tools developed for the data quality pipeline and the energy efficiency methods for reducing its environmental cost, as well as concrete technical details about the implementation of those tools. Thus, it can be considered that this is both a high-level and a technical document, thus targeting a wide audience. Primarily, the document targets the SEDIMARK consortium, discussing the technical implementations and the initial ideas about them, so that the rest of the technical tasks can draw ideas about the integration of all the components into a single SEDIMARK platform. Apart from that, this document also targets the scientific and research community, since it presents new ideas about data quality and how the developed tools can help researchers and scientists improve the quality of the data they use in their research or applications. Similarly, the industrial community can leverage the project tools to improve the quality of their datasets or also assess how they can exploit the results about energy efficiency to reduce the energy consumption of their data processing pipelines. Moreover, EU initiatives and other research projects should consider the contents of the deliverable in order to derive common concepts about data quality and reducing energy consumption in data pipelines.

This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.

Subscribe to SEDIMARK!

SEDIMARK Follow

SEcure Decentralised Intelligent Data MARKetplace. A #horizoneurope project funded by the European Union.

Retweet on Twitter SEDIMARK Retweeted

Avatar European Commission @eu_commission ·

25 Dec

Wishing you a very Merry Christmas, wherever you are ✨🎄

Reply on Twitter 2004092247311302865 Retweet on Twitter 2004092247311302865 268 Like on Twitter 2004092247311302865 1839 Twitter 2004092247311302865

Avatar SEDIMARK @sedimark ·

24 Nov

Want to know more about the technology we developed during the last three years? Do not miss our paper "A decentralised architecture for secure exchange of assets in data spaces: The case of SEDIMARK" where we explain in detail our architecture for a decentralised marketplace!

SEDIMARK @sedimark

“A Decentralised Architecture for Secure Exchange of Assets in Data Spaces: The Case of SEDIMARK”
The SEDIMARK decentralized architecture for secure asset exchange within data spaces:
https://doi.org/10.1016/j.dib.2025.111757
#DataSpaces

Reply on Twitter 1992920873800806518 Retweet on Twitter 1992920873800806518 1 Like on Twitter 1992920873800806518 3 Twitter 1992920873800806518

Avatar SEDIMARK @sedimark ·

24 Nov

Smaller, Faster, Smarter AI 🧠
In the world of AI, bigger isn't always better, especially for edge devices. 📉
Introducing Model Pruning from the SEDIMARK Toolbox: a technique to slash model size and computational costs without losing predictive power. 🧵

Reply on Twitter 1992920343917596925 Retweet on Twitter 1992920343917596925 1 Like on Twitter 1992920343917596925 3 Twitter 1992920343917596925