SEDIMARK Logo

D3.2 - Energy efficient AI-based toolset for improving data quality. Final version

SEDIMARK · July 31, 2025
D3.2

Data quality is of the highest importance for companies to improve their decision-making systems and the efficiency of their products. In this current data-driven era, it is important to understand the effect that “dirty” or low-quality data can have on a business. Manual data cleaning is the common way to process data, accounting for more than 50% of the time of knowledge workers. SEDIMARK acknowledges the importance of data quality for both sharing and using data to extract knowledge and information for decision-making processes. Thus, one of the main goals of SEDIMARK is to develop a data processing pipeline that assesses and improves the quality of data generated and shared by the SEDIMARK data providers.


This deliverable presents the final version of the methods and techniques developed within SEDIMARK for processing data and improving their quality, extending the first version which was delivered in SEDIMARK Deliverable D3.1 [74]. The focus in this deliverable is to present the final version of the key techniques that are used for quality improvement of datasets, based on the requirements of the SEDIMARK platform so that they all work together smoothly.


SEDIMARK considers two main types of data generated and shared within the marketplace: (i) static/offline datasets and (ii) dynamic/streaming datasets. The project acknowledges that it is important to cater to both types of datasets equally, thus in most scenarios, separate and customised versions of the tools have been developed for static and streaming datasets. Techniques for outlier detection, noise removal, deduplication and imputation of missing values are important for improving the quality of datasets. These techniques aim to remove abnormal values or noise from the dataset, remove duplicate values or fill out gaps in some entries or add complete entries. Techniques for feature engineering such as feature extraction and selection have also been developed to enrich the datasets. Synthetic dataset creation is important in scenarios where data providers don’t want to share their real datasets (i.e. for privacy reasons) but want to share synthetic versions that mimic the real ones.


This deliverable also presents the framework to orchestrate the whole functionality of the data processing pipeline using a Data Processing Orchestration. This component enables end users to interact with the built-in data processing solutions through a simplified dashboard interface. This deliverable also presents the final version of the key quality metrics that SEDIMARK has defined for assessing the quality of datasets, both per data point and as a whole, as well as the key techniques for dataset quality improvement, designed to meet SEDIMARK platform requirements and ensure seamless integration.


Another important part is the description of techniques towards reducing the energy consumption of the components of the data processing pipeline and optimizing data efficiency, i.e. using techniques for data distillation, coreset selection and dimension reduction. Minimising the communication cost in distributed machine learning scenarios is also important for SEDIMARK, because communication can increase energy consumption. Techniques to optimise the Artificial Intelligence (AI) models both during training and inference are also presented, focusing on quantisation, pruning, low rank factorisation and knowledge distillation.
Finally, considering that minimising energy

consumption can influence performance or communication, the deliverable presents the final analysis on these trade-offs, aiming to provide insights to data providers on how to better configure the pipeline or what models they should select in order to achieve their targets (energy efficiency/performance/communication).

D3.2 deliverable can be downloaded from here.

Subscribe to SEDIMARK!

* required

We’re proud to have hosted the #Data4Mobility Hackathon in Santander at the CIE (Centro de Iniciativas Empresariales), bringing together brilliant developers, data scientists, and innovators to explore new mobility solutions powered by the #SEDIMARK Platform. 🧵👇

Full-Feature Specification to back up #SmartCities 🚄🚝“Smart cities will benefit from all this work, as the NGSI-LD #API is used to ‘glue together’ existing databases across many city services for citizens” Lindsay Frost, chairman @ETSI_STANDARDS ISG CIM
https://bit.ly/2SAt1Gx

Want to learn more about NGSI-LD? Pick your language and get started!

🇬🇧 https://en.wikipedia.org/wiki/NGSI-LD
🇪🇸 https://es.wikipedia.org/wiki/NGSI-LD
🇯🇵 https://ja.wikipedia.org/wiki/NGSI-LD
🇩🇪 https://de.wikipedia.org/wiki/NGSI-LD
🇫🇷 https://fr.wikipedia.org/wiki/NGSI-LD

Load More
crossmenu
SEDIMARK
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.