SEDIMARK D3.3 “Enabling tools for data interoperability, distributed data storage and training distributed AI models” is a report produced by the SEDIMARK project. It aims at providing insights in relation to progress made for:
Management of interoperability in data flows by proper definition and usage of metadata and semantic technologies
Distributed storage considering interoperability and scalability issues of such storages.
Finally, on the use of distributed AI and analytics, considering in particular federated learning.
Data quality is considered to be of the highest importance for companies to improve their decision-making systems and the efficiency of their products. In this current data-driven era, it is important to understand the effect that “dirty” or low-quality data (i.e., data that is inaccurate, incomplete, inconsistent, or contains errors) can have on a business. Manual data cleaning is the common way to process data, accounting for more than 50% of the time of knowledge workers. SEDIMARK acknowledges the importance of data quality for both sharing and using/analysing data to extract knowledge and information for decision-making processes.
Common types of dirty or low-quality data include:
Missing Data: Incomplete datasets where certain values are not recorded.
Inaccurate Data: Data that contains errors, inaccuracies, or typos. This can happen due to manual data entry errors or system malfunctions.
Inconsistent Data: Data that is inconsistent across different sources or within the same dataset. For example, the same entity may be represented in different ways (e.g., "Mr. Smith" vs. "Smith, John").
Duplicate Data: Repetition of the same data in a dataset, which can distort analyses and lead to incorrect results.
Outliers: Data points that deviate significantly from the majority of the dataset, potentially skewing the analysis.
Bias: Data that reflects systematic errors due to a particular group's overrepresentation or underrepresentation in the dataset.
Unstructured Data: Data that lacks a predefined data model or organization, making it difficult to analyse.
Thus, one of the main work items of SEDIMARK is to develop a usable data processing pipeline that assesses and improves the quality of data generated and shared by the SEDIMARK data providers.
This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.
"Much as steam engines energised the Industrial Age, recommendation engines are the prime movers digitally driving 21st-century advice worldwide"
Recommender systems are the prime technology for improving business in the digital world. The list is endless, from movie recommendations on Netflix to video recommendations on YouTube, to playlist recommendations on Spotify, to best route recommendations in Google Maps.
Recommender systems are a subset of Artificial Tools designed to learn user preferences from the vast amount of data, including but not limited to user-item interactions, past user behaviours, and so on. They are capable of delivering individualised, tailor-made recommendations that enhance user experience while boosting business revenue.
In SEDIMARK our goal is not only to provide a secure data-sharing platform, but also to enhance the experience of our users through the use of Recommender Systems at multiple levels:
Navigating the vast amount of datasets available within the platform
Suggesting ready-made AI models relevant to the given dataset
Suggesting computational services capable of handling the given dataset
Recommending Datasets:
SEDIMARK is a platform offering a vast amounts of data available for purchase in a secure way. To enhance the user experience, the team at UCD is developing cutting-edge recommender systems specifically targeted at dataset recommendation. The system will leverage data from past users’ behaviour such as past purchases, past browsing history, or behaviour of other similar users.
Recommending Ready-made AI Models:
Apart from datasets, SEDIMARK will also offer ready-made AI models capable of extracting information from the specific dataset, such as for example weather forecast AI model that can learn from weather data collected by sensors. The recommender system in this category will be able to suggest the most relevant AI models for the given dataset. This will allow the users to fully explore the potential of the purchased dataset within the SEDIMARK platform.
Recommending Computational Services:
Depending on the size of the purchased dataset and the complexity of the AI model, SEDIMARK will also aim to suggest the appropriate computation services that can carry out the learning process and are available within the SEDIMARK platfrom.
Using a single platform, users in SEDIMARK will be able to perform data discovery, learn insights from a given dataset using AI models and utilise computational services.
In the strive to achieve interoperability in information models, it is important to relate concepts that we define in our models to others that have already been developed and have gained popularity. This can be done by reuse, inheritance or an explicit relationship with that concept. There are many approaches to ontology development, but by far, Protégé has been the de facto tool. Reusing concepts from another ontology in Protégé involves importing the relevant axioms from the source ontology into your current ontology. Here is a quick tutorial on how to achieve this:
1. Open your ontology
Start Protege and open the ontology where you want to reuse concepts.
2. Open the reused ontology in the same window
3. Reuse specific concepts
Protégé provides tools like ‘Refactor’ to copy or move axioms between ontologies.
You can select specific classes, properties, and axioms to import into your ontology.
Select Axioms by reference, which will include other relationship with respect to the concept to be reused.
Select the concepts required:
Check that the other axioms to be included are relevant.
Select your ontology as the target.
4. Manage Namespaces
Ensure that the namespaces for the reused concepts are correctly managed in your ontology to avoid conflicts.
5. Save Changes
After reusing the desired concepts, save your ontology to apply the changes.
To note, reusing axioms from another ontology may result in new logical consequences, therefore consistency and correctness of the developed ontology should be validated after the reuse process.
This report represents the first version of the SEDIMARK’s approach on what will be its data sharing platform, as the main entry point to the system from the outside world. Hence, it must touch base not only on the front-end users will interact with but also on added features such as the Recommender system and the Open Data enabler which are at the essence of the solution. Given the stage on the project execution, the contents hereby presented will be subjected to an evolution and thus a new version of the SEDIMARK data sharing platform will be provided in Month 34 (July 2025) in the Deliverable 4.6 (Data sharing platform and incentives. Final version). Therefore, this document does not offer a fully functional depiction of this platform, just a high-level presentation of its constitutive components instead. In fact, in what refers to the Marketplace front-end a description will appear in the report, while as only the Recommender system and the open data enabler will be described also from a backend perspective. Thus, it is intended for a certain audience, mainly for members in the project consortium to employ it as the template to drive specific technical activities from other work packages within SEDIMARK.
This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.
In response to the growing demand for secure and transparent data exchange, the infrastructure of SEDIMARK Marketplace leverages cutting-edge technologies to establish a resilient network.
This is the first version of the decentralised infrastructure and access management mechanisms. This deliverable presents a comprehensive overview of the decentralized infrastructure and access management mechanisms implemented in the SEDIMARK Marketplace. As the landscape of data exchange evolves, the decentralization approach ensures increased security, transparency, and user-centric control both over data assets and user identity information. Data providers of the SEDIMARK Marketplace can also provide additional types of assets, related to their data. Examples are Machine Learning (ML) Models, data processing pipelines and tools.
Operating within the principles of decentralization, this project addresses the growing need for secure and transparent data exchange in a globalized digital economy. The SEDIMARK Marketplace leverages distributed ledger technologies to establish a resilient and scalable infrastructure. The decentralized architecture of the marketplace is built on a robust distributed ledger employed for user identity management, as well as blockchain foundation, fostering tamper-resistant contracts. Utilizing a distributed network, the infrastructure eliminates single points of failure, enhancing reliability and ensuring the continued availability of assets to be exchanged. The decentralized infrastructure supports standardized protocols for data exchange, enabling collaboration and data sharing across various platforms and participants.
This deliverable is the first capstone in the SEDIMARK project, realizing the underlying infrastructure and mechanism that allow the fulfilment of the functionalities defined for the Marketplace. An updated version of this deliverable will be provided in the next Deliverable SEDIMARK_D4.2 in July 2025.
This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.
This document is the first deliverable from WP3, aiming to provide a first draft of the SEDIMARK data quality pipeline. This document details how the pipeline aims to improve the quality of datasets shared through the marketplace while also addressing the problem of energy efficiency in the data value chain. This document is actually the first version of the deliverable, showing the initial ideas and initial implementation of the respective tools and techniques. An updated version of the deliverable with the final version of the data quality pipeline will be delivered in M34 (July 2025). This means that the document should be considered a “live” document that will be continuously updated and improved as the technical development of the data quality tools evolves.
The main goal of this document is to discuss how data quality is seen in SEDIMARK, what are the metrics defined in order to assess the quality of data that are generated by data providers, and what techniques will be provided to them for improving the quality of their data before sharing on the data marketplace. This will help the data providers to both optimise their decision-making systems for the Machine Learning (ML) models they train using their datasets, and to increase their revenues by selling datasets of higher quality and thus higher value. Regarding the first argument, it is well documented that low-quality data has a significant impact on business, with reports showing a yearly cost of around 3 Trillion USD, and that knowledge workers waste 50% of their time searching for and correcting dirty data [1]. It is evident that data providers will hugely benefit from automated tools to help them improve their data quality, either without any human involvement or with minimum human intervention and configuration.
The document presents high-level descriptions of the concepts and tools developed for the data quality pipeline and the energy efficiency methods for reducing its environmental cost, as well as concrete technical details about the implementation of those tools. Thus, it can be considered that this is both a high-level and a technical document, thus targeting a wide audience. Primarily, the document targets the SEDIMARK consortium, discussing the technical implementations and the initial ideas about them, so that the rest of the technical tasks can draw ideas about the integration of all the components into a single SEDIMARK platform. Apart from that, this document also targets the scientific and research community, since it presents new ideas about data quality and how the developed tools can help researchers and scientists improve the quality of the data they use in their research or applications. Similarly, the industrial community can leverage the project tools to improve the quality of their datasets or also assess how they can exploit the results about energy efficiency to reduce the energy consumption of their data processing pipelines. Moreover, EU initiatives and other research projects should consider the contents of the deliverable in order to derive common concepts about data quality and reducing energy consumption in data pipelines.
This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.
Machine Learning (ML) algorithms have demonstrated remarkable advancements across diverse fields, evolving to become more intricate and data-intensive. This evolution is particularly driven by the expanding size of datasets and the infinite growing nature of data streams. However, this substantial progress has come at the cost of intensified energy consumption, emphasizing the urgent requirement for resource-efficient methodologies. It is thus crucial to balance computational demands and model performance to mitigate the escalating environmental impact associated with the energy-intensive nature of machine learning processes.
Enhancing data efficiency stands as a central strategy in SEDIMARK to manage the considerable energy needs inherent in machine learning algorithms. SEDIMARK aims to achieve resource and energy efficiency during the training of ML models by reducing the quantity of data needed without compromising performance. To accomplish this, SEDIMARK will use summarization techniques in conjunction with ML algorithms, including but not limited to dimension reduction, sampling, and other reduction strategies.
In the SEDIMARK AI pipeline, dimension reduction techniques play a crucial role in mitigating resource consumption. By reducing the number of features, both computational complexity and memory requirements can be substantially lowered. Furthermore, the removal of irrelevant features through this process can enhance overall quality performance. Two main strategies that exist within dimension reduction are feature selection and feature extraction. The former involves the selection of a subset of the input features, while the latter entails constructing a new set of features in a lower-dimensional space from a given set of input features. This dual approach ensures a nuanced and effective reduction in the data footprint, contributing significantly to the overall goal of resource and energy efficiency in SEDIMARK.
Sampling is another effective strategy for resource-efficient machine learning. Instead of analyzing the entire dataset or maintaining a whole data stream, algorithms operate on a representative subset (or a sliding window for data streams). This approach is particularly useful for large datasets where processing the entire set is impractical.
Resource-efficient machine learning is not just a practical necessity but a crucial avenue for sustainable and scalable model development. By strategically employing dimension reduction, sampling, corsets, data distillation and other summarization techniques, the ML models will be computationally frugal, making them particularly suitable for deployment on devices with limited processing capabilities, such as edge and IoT devices. SEDIMARK can strike a balance between computational efficiency and model accuracy. As machine learning evolves, these optimization strategies will play an increasingly vital role in ensuring that advanced algorithms remain accessible and practical in real-world applications.
The document is the first deliverable of WP5 and reports the results of T5.1 activities aimed at recommending an evaluation methodology, performance metrics, and a timetable for the integration of the SEDIMARK platform according to the rules of decentralization, trustworthiness, intelligence, data quality, and interoperability. This deliverable is important because it defines the evaluation methodology, monitoring approach, and efficiency of what is being built, as well as the system validation through real pilot demonstrations. In order to assess the framework's capabilities from various user perspectives, the developed methodology adapts multiple quality factors implemented using technical metrics.
Before delving into the core of the deliverable, the document briefly describes the vision of the SEDIMARK marketplace, in which participants will exchange assets in a secure decentralized manner. In SEDIMARK D2.2, the architecture’s components were thoroughly examined. To create the overall decentralized solution, the integration activities are based on those components and tools under a standard development framework.
All technology providers are accountable for the various modules to which they are assigned based on a top-down integration plan that is outlined in this document. Some architecture components are not included in the first version of the platform because they are part of the platform's second and final releases. The initial release focuses on enhancing the minimum functionalities required to provide a minimum viable product. The integration plan is built upon the use case scenarios defined in T2.1 and SEDIMARK D2.1 and the timeline for the execution of the scenarios. The components are integrated using Virtual Machines (VMs), docker containers, and other orchestration tools.
This deliverable also specifies a customized evaluation process as well as numerous criteria to be employed in this evaluation. The criteria comprise technical criteria tailored to each technique/module evaluated, as well as general criteria/KPIs tailored to each use case and a metrics framework based on ISO/IEC established methods for system and product quality assessment. The standardization provides the procedures with security and compatibility. The framework will begin with the establishment of a comprehensive and meaningful set of performance metrics based on the requirements and use cases of the stakeholders. Just to remind, SEDIMARK encompasses four main use cases in different sites: Mobility Digital Twin (Finland), Urban Bike Mobility Planning (Spain), Valorisation of Energy Consumption and Customer Reactions/Complaints (Greece), and Valuation and Commercialization of Water Data (France).
This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.
Nowadays, users register to a service and, usually, the service itself stores the users data - the identity. Today the majority of online services are centralized and rely, in some form, on a single authority for identity management. SEDIMARK instead aims to be a fully decentralized data Marketplace.
This architectural choice has consequences also on the management of the users belonging to the system. With decentralization in mind, SEDIMARK adopts a new model for the identity, the Self Sovereign Identity (SSI).
SSI
SSI is a digital identity model that gives the user who creates it full control over his or her identity and the information to be shared.
The SSI model is rooted on the Decentralized Identity paradigm: it is the user him/her-self – the Holder of the identity - that owns a unique identity composed of a set of attributes.
The attributes are releasedand associated to the identity by other entities – the Issuers of such claims.
These claims can be checked by other entities – called Verifiers. As an example, imagine a new graduate from a university. His/Her digital identity may contain a claim “Graduated” issued by the University. A future employer who wants to check this information acts as the Verifier.
SSI in practice
SSI? Never heard of it!
Yes, SSI is is a relatively new concept in the field of digital identity. It is an emerging technology relying on blockchain and other distributed ledgers which are in turn still evolving. Embracing and implementing these new identity systems is a process that requires time…
…But things are moving forward!
Microsoft has recently released a new product called Microsoft Entra Verified ID that employs decentralized identity.
Also European Union is addressing the EU citizen identities towards a model where the users have full control of their data with the European Digital Identity Wallet.
SSI in SEDIMARK
SEDIMARK will deploy its own custom SSI framework relying on IOTA Tangle DLT.
The users of the marketplace will have full control on their digital identity, allowing to preserve and maintain their privacy. Users have the ability to create and manage their own identities without relying on a central authority.
Moreover, thanks to SSI, also the authentication and authorization policies can be enforced with a more granular control. For example, a data provider can verify who is authorized to receive its data, liming the access only to a certain group.
Do you want to know more? Stay tuned for next blog posts by signing up to our newsletter below.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!