D3.3 Enabling tools for data interoperability, distributed data storage and training distributed AI models. First version

SEDIMARK · January 18, 2024

SEDIMARK D3.3 “Enabling tools for data interoperability, distributed data storage and training distributed AI models” is a report produced by the SEDIMARK project. It aims at providing insights in relation to progress made for:

Management of interoperability in data flows by proper definition and usage of metadata and semantic technologies
Distributed storage considering interoperability and scalability issues of such storages.
Finally, on the use of distributed AI and analytics, considering in particular federated learning.

Data quality is considered to be of the highest importance for companies to improve their decision-making systems and the efficiency of their products. In this current data-driven era, it is important to understand the effect that “dirty” or low-quality data (i.e., data that is inaccurate, incomplete, inconsistent, or contains errors) can have on a business. Manual data cleaning is the common way to process data, accounting for more than 50% of the time of knowledge workers. SEDIMARK acknowledges the importance of data quality for both sharing and using/analysing data to extract knowledge and information for decision-making processes.

Common types of dirty or low-quality data include:

Missing Data: Incomplete datasets where certain values are not recorded.
Inaccurate Data: Data that contains errors, inaccuracies, or typos. This can happen due to manual data entry errors or system malfunctions.
Inconsistent Data: Data that is inconsistent across different sources or within the same dataset. For example, the same entity may be represented in different ways (e.g., "Mr. Smith" vs. "Smith, John").
Duplicate Data: Repetition of the same data in a dataset, which can distort analyses and lead to incorrect results.
Outliers: Data points that deviate significantly from the majority of the dataset, potentially skewing the analysis.
Bias: Data that reflects systematic errors due to a particular group's overrepresentation or underrepresentation in the dataset.
Unstructured Data: Data that lacks a predefined data model or organization, making it difficult to analyse.

Thus, one of the main work items of SEDIMARK is to develop a usable data processing pipeline that assesses and improves the quality of data generated and shared by the SEDIMARK data providers.

This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.

Subscribe to SEDIMARK!

SEDIMARK Follow

SEcure Decentralised Intelligent Data MARKetplace. A #horizoneurope project funded by the European Union.

Retweet on Twitter SEDIMARK Retweeted

Avatar European Commission @eu_commission ·

25 Dec

Wishing you a very Merry Christmas, wherever you are ✨🎄

Reply on Twitter 2004092247311302865 Retweet on Twitter 2004092247311302865 269 Like on Twitter 2004092247311302865 1839 Twitter 2004092247311302865

Avatar SEDIMARK @sedimark ·

24 Nov

Want to know more about the technology we developed during the last three years? Do not miss our paper "A decentralised architecture for secure exchange of assets in data spaces: The case of SEDIMARK" where we explain in detail our architecture for a decentralised marketplace!

SEDIMARK @sedimark

“A Decentralised Architecture for Secure Exchange of Assets in Data Spaces: The Case of SEDIMARK”
The SEDIMARK decentralized architecture for secure asset exchange within data spaces:
https://doi.org/10.1016/j.dib.2025.111757
#DataSpaces

Reply on Twitter 1992920873800806518 Retweet on Twitter 1992920873800806518 1 Like on Twitter 1992920873800806518 3 Twitter 1992920873800806518

Avatar SEDIMARK @sedimark ·

24 Nov

Smaller, Faster, Smarter AI 🧠
In the world of AI, bigger isn't always better, especially for edge devices. 📉
Introducing Model Pruning from the SEDIMARK Toolbox: a technique to slash model size and computational costs without losing predictive power. 🧵

Reply on Twitter 1992920343917596925 Retweet on Twitter 1992920343917596925 1 Like on Twitter 1992920343917596925 3 Twitter 1992920343917596925