D3.2 - Energy efficient AI-based toolset for improving data quality. Final version

SEDIMARK · July 31, 2025

Data quality is of the highest importance for companies to improve their decision-making systems and the efficiency of their products. In this current data-driven era, it is important to understand the effect that “dirty” or low-quality data can have on a business. Manual data cleaning is the common way to process data, accounting for more than 50% of the time of knowledge workers. SEDIMARK acknowledges the importance of data quality for both sharing and using data to extract knowledge and information for decision-making processes. Thus, one of the main goals of SEDIMARK is to develop a data processing pipeline that assesses and improves the quality of data generated and shared by the SEDIMARK data providers.

This deliverable presents the final version of the methods and techniques developed within SEDIMARK for processing data and improving their quality, extending the first version which was delivered in SEDIMARK Deliverable D3.1 [74]. The focus in this deliverable is to present the final version of the key techniques that are used for quality improvement of datasets, based on the requirements of the SEDIMARK platform so that they all work together smoothly.

SEDIMARK considers two main types of data generated and shared within the marketplace: (i) static/offline datasets and (ii) dynamic/streaming datasets. The project acknowledges that it is important to cater to both types of datasets equally, thus in most scenarios, separate and customised versions of the tools have been developed for static and streaming datasets. Techniques for outlier detection, noise removal, deduplication and imputation of missing values are important for improving the quality of datasets. These techniques aim to remove abnormal values or noise from the dataset, remove duplicate values or fill out gaps in some entries or add complete entries. Techniques for feature engineering such as feature extraction and selection have also been developed to enrich the datasets. Synthetic dataset creation is important in scenarios where data providers don’t want to share their real datasets (i.e. for privacy reasons) but want to share synthetic versions that mimic the real ones.

This deliverable also presents the framework to orchestrate the whole functionality of the data processing pipeline using a Data Processing Orchestration. This component enables end users to interact with the built-in data processing solutions through a simplified dashboard interface. This deliverable also presents the final version of the key quality metrics that SEDIMARK has defined for assessing the quality of datasets, both per data point and as a whole, as well as the key techniques for dataset quality improvement, designed to meet SEDIMARK platform requirements and ensure seamless integration.

Another important part is the description of techniques towards reducing the energy consumption of the components of the data processing pipeline and optimizing data efficiency, i.e. using techniques for data distillation, coreset selection and dimension reduction. Minimising the communication cost in distributed machine learning scenarios is also important for SEDIMARK, because communication can increase energy consumption. Techniques to optimise the Artificial Intelligence (AI) models both during training and inference are also presented, focusing on quantisation, pruning, low rank factorisation and knowledge distillation.
Finally, considering that minimising energy

consumption can influence performance or communication, the deliverable presents the final analysis on these trade-offs, aiming to provide insights to data providers on how to better configure the pipeline or what models they should select in order to achieve their targets (energy efficiency/performance/communication).

D3.2 deliverable can be downloaded from here.

Subscribe to SEDIMARK!

SEDIMARK Follow

SEcure Decentralised Intelligent Data MARKetplace. A #horizoneurope project funded by the European Union.

Retweet on Twitter SEDIMARK Retweeted

Avatar European Commission @eu_commission ·

25 Dec

Wishing you a very Merry Christmas, wherever you are ✨🎄

Reply on Twitter 2004092247311302865 Retweet on Twitter 2004092247311302865 269 Like on Twitter 2004092247311302865 1838 Twitter 2004092247311302865

Avatar SEDIMARK @sedimark ·

24 Nov

Want to know more about the technology we developed during the last three years? Do not miss our paper "A decentralised architecture for secure exchange of assets in data spaces: The case of SEDIMARK" where we explain in detail our architecture for a decentralised marketplace!

SEDIMARK @sedimark

“A Decentralised Architecture for Secure Exchange of Assets in Data Spaces: The Case of SEDIMARK”
The SEDIMARK decentralized architecture for secure asset exchange within data spaces:
https://doi.org/10.1016/j.dib.2025.111757
#DataSpaces

Reply on Twitter 1992920873800806518 Retweet on Twitter 1992920873800806518 1 Like on Twitter 1992920873800806518 3 Twitter 1992920873800806518

Avatar SEDIMARK @sedimark ·

24 Nov

Smaller, Faster, Smarter AI 🧠
In the world of AI, bigger isn't always better, especially for edge devices. 📉
Introducing Model Pruning from the SEDIMARK Toolbox: a technique to slash model size and computational costs without losing predictive power. 🧵

Reply on Twitter 1992920343917596925 Retweet on Twitter 1992920343917596925 1 Like on Twitter 1992920343917596925 3 Twitter 1992920343917596925