admin, Author at SEDIMARK

In the strive to achieve interoperability in information models, it is important to relate concepts that we define in our models to others that have already been developed and have gained popularity. This can be done by reuse, inheritance or an explicit relationship with that concept. There are many approaches to ontology development, but by far, Protégé has been the de facto tool. Reusing concepts from another ontology in Protégé involves importing the relevant axioms from the source ontology into your current ontology. Here is a quick tutorial on how to achieve this:

1. Open your ontology

Start Protege and open the ontology where you want to reuse concepts.

2. Open the reused ontology in the same window

3. Reuse specific concepts

Protégé provides tools like ‘Refactor’ to copy or move axioms between ontologies.

You can select specific classes, properties, and axioms to import into your ontology.

Select Axioms by reference, which will include other relationship with respect to the concept to be reused.

Select the concepts required:

Check that the other axioms to be included are relevant.

Select your ontology as the target.

4. Manage Namespaces

Ensure that the namespaces for the reused concepts are correctly managed in your ontology to avoid conflicts.

5. Save Changes

After reusing the desired concepts, save your ontology to apply the changes.

To note, reusing axioms from another ontology may result in new logical consequences, therefore consistency and correctness of the developed ontology should be validated after the reuse process.

SEDIMARK does not focus on one particular domain but intends to design and prototype a secure decentralised and intelligent data and services marketplace that bridges remote data platforms and allows the efficient and privacy-preserving sharing of vast amounts of heterogeneous, high-quality, certified data and services supporting the common EU data spaces.

Four use cases are included in the project and one of them, driven by EGM, will focus on ‘water data’ exploitation. SEDIMARK will make use of the AI-based tools for data quality management, metadata management and semantic interoperability for the gathering of data. It will build upon the SEDIMARK decentralized infrastructure for handling of security and privacy policies and provide integrated services for validation, semantic enrichment and transformation of the data.

In our commitment to collaboratively advance water data management and utilization, SEDIMARK is proud to partner with the ICT4WATER cluster, comprising more than 60 pioneering highly digitized water-related projects. As part of this collaboration, SEDIMARK will actively contribute by sharing its latest advancements with the ICT4WATER ecosystem, fostering the creation of a decentralized and secure marketplace for data and services. Together, we aim to collectively drive innovation and sustainable solutions in the realm of water resource management supported by ICT tools.

In the modern digital age, ensuring seamless data management, storage, and retrieval is of utmost importance. Enter Distributed Storage Solutions (DSS), the backbone for businesses aiming for consistent data access, elasticity, and robustness. At the core of leveraging the full prowess of DSS is an element often overlooked - the orchestrator pipeline. Let’s dive deeper into why this component is the unsung hero of data management.

A Deep Dive into Distributed Storage

Rather than placing all eggs in one basket with a singular, centralized system, DSS prefers to spread them out. By scattering data over a multitude of devices, often geographically dispersed, DSS ensures data availability, even when individual systems falter. It's the vanguard of reliable storage for the modern enterprise.

Why the Orchestrator Pipeline Steals the Show

Imagine an orchestra without its conductor – chaotic, right? The orchestrator pipeline for DSS is much like that crucial conductor, ensuring every piece fits together in perfect harmony. Here's how it makes a difference in the realm of DSS:

The Automation Magic: Seamlessly manages data storage, retrieval, and flow across various nodes.
Master of Balancing: Channelizes data traffic efficiently, promoting top-tier performance with minimal lag.
Guardian Angel Protocols: Steps in to resurrect data during system failures, keeping business operations uninterrupted.
The Efficiency Maestro: Regularly gauges system efficiency, making on-the-fly tweaks for optimal functioning.

Why Combine the Orchestrator with DSS?

There are four main reasons to combine the orchestrator with DSS:

Trustworthy Operations: By streamlining and fine-tuning data tasks, it minimizes chances of human errors.
Effortless Scaling: As data reservoirs expand, the orchestrator ensures DSS stretches comfortably, dodging manual hiccups.
Resource Utilization at its Best: Champions the cause of optimal resource use, optimizing costs in the long run.
Silky-smooth Functioning: System updates or maintenance? The orchestrator ensures no hitches, keeping operations smooth.

Final Thoughts

While DSS paints a compelling picture of modern data storage, the orchestrator pipeline is the brush that brings out its true colors, crafting an efficient, harmonious data masterpiece. In a world where data stands tall as a business linchpin, it's not just about storing it – it's about managing it with flair.

SEDIMARK recently participated in Data Week 2023 in Lulea, Sweden, which was organised by the Big Data Value Association (BDVA), a European initiative promoting data-driven digital transformation of society and the economy. Sedimark presented their work at a session organised by the Data Spaces Business Alliance (DSBA), an organisation promoting business transformation in the data economy.

The session, entitled, “Data Management and Data Sharing for trusted AI platforms” saw SEDIMARK present their concept alongside a diverse group of EU Horizon funded projects (Waterverse, STELAR, EnrichMyData and HPLT) also focussed on future tools for data management and quality control. The session pondered the question of how the tools and approaches developed within these projects would support the implementation and deployment of data driven and trustworthy AI applications within data spaces.

A further aim of the session was to consider how the projects could make use of and contribute to a number of core common building blocks for data spaces outlined in a recent working document of the DSBA. The individual project presentations were followed by a lively panel discussion, in which these questions were further pursued.

* Image credit: BDVA Twitter account

SEDIMARK partners: University of Surrey, INRIA and University College of Dublin published a new work at the 2022 8^th #IEEE World Forum on IoT Conference in Yokohama, Japan, on a privacy-preserving ontology inspired by #GDPR requirements, for semantically interoperable #IoT data value chains. Check out the paper here.

Abstract

Testing and experimentation are crucial for promoting innovation and building systems that can evolve to meet high levels of service quality. IoT data that belong to users and from which their personal information can be inferred are frequently shared in the background of IoT systems with third parties for experimentation and building quality services. This sharing raises privacy concerns, as in most cases, the data are gathered and shared without the user's knowledge or explicit consent. With the introduction of GDPR, IoT systems and experimentation platforms that federate data from different deployments, testbeds, and data providers must be privacy-preserving. The wide adoption of IoT applications in scenarios ranging from smart cities to Industry 4.0 has raised concerns for the privacy of users' data collected using IoT devices. Inspired by the GDPR requirements, we propose an IoT ontology built using available standards that enhances privacy, enables semantic interoperability between IoT deployments, and supports the development of privacy-preserving experimental IoT applications. We also propose recommendations on how to efficiently use our ontology within a IoT testbed and federating platforms. Our ontology is validated for different quality assessment criteria using standard validation tools. We focus on “experimentation” without loss of generality because it covers scenarios from both research and industry that are directly linked with innovation.

Data has become a new currency in the current data-driven economy. The EU has launched the EU data market monitoring tool, which continuously monitors the impact of the data economy in the member states. The tool has identified that in 2021 the EU there are more than 190.000 data supplier companies, and more than 560.000 data users. For the revenues of the data companies, the 2022 figures are at 84B Euros with forecasts towards 114B in 2025 and 137B in 2030.

Data quality is a persistent issue in the ongoing digital transformation of European businesses. Some studies estimate that the curation and cleaning of data can take up as much as 80% of data professionals time, which is time not spent on producing insights or actual products and services. The quality of data used within a business is of utmost importance, since according to reports, bad data or dirty data cost the US 3 trillion USD per year. For instance, many business datasets can contain a high number of anomalies, duplicates, errors and missing values which degrade the value of the data for making business decisions. Unhandled bias in the data, such as when a dataset fails to account properly for minority labels (i.e. for gender, age, origin, or any other label in the data), can result in machine learning models that are skewed towards unfair outcomes. Additionally, much of the existing data in business silos often lacks the proper documentation and annotation that would allow professionals to properly leverage it towards downstream decision making tasks.

One of the main pillars of the SEDIMARK project is to promote data quality for the data that will be shared on the marketplace. SEDIMARK will build a complete data curation and quality improvement pipeline that will be provided to the data providers so that they can assess the quality of their data and clean them in order to improve the quality. This data curation pipeline will require minimum human intervention from domain experts to provide optimal results, but will also be fully customisable for experts to unlock maximum performance. This will be achieved by exploring state of the art techniques in Auto-ML (automated machine learning) and Meta-ML, both of which can be applied to transform the data with minimal human supervision, by learning from previous tasks.

Additionally, the marketplace within SEDIMARK will prioritise and promote data providers who undertake the effort to curate their data before sharing them widely. SEDIMARK will implement a range of transparent data quality metrics that will show statistics about the data and these will be displayed side by side with the data offerings on the marketplace. This will help consumers to find high quality data so that they minimise the time they spend preprocessing the data for their services. Moreover, an efficient recommender system within SEDIMARK will also help consumers to easily find high quality, highly rated offerings within their domain.

In conclusion, SEDIMARK aims to provide tools for improving the data quality both for providers and consumers, boosting the data economy, while at the same time saving significant time from data scientists, allowing them to focus more their time on actually extracting value from the high quality data producing new products and services instead of spending the majority of their time cleaning the data themselves. The SEDIMARK team in the Insight Centre for Data Analytics of the University College Dublin (UCD) builds on Insight’s data expertise and is leading the activities to define efficient metrics for data quality and develop an automated and simplified data curation and quality improvement pipeline for data providers and users to check and improve their datasets.

Data interoperability refers to the functionality of information systems to exchange data and enable information sharing. More specifically, itis defined as the ability of systems and services that create, exchange, and consume data to have clear, shared expectations for the format, contents, context, and meaning of that data. Thus, it allows to access and process data from multiple sources in diverse formats without losing sense and then to integrate that data for mapping, visualization, and other forms of representation and analysis. Data interoperability enables people to find, explore, and understand the structure and content of heterogeneous data.

In this context, SEDIMARK aims to provide an enriched secure decentralized data and services marketplace where scattered data from various domains and geographical locations within the EU can be easily generated, cleaned, protected, discovered, and enriched with metadata, AI and analytics, and exploited for diverse business and research scenarios. SEDIMARK involves a combination of heterogeneous data, and achieving data interoperability will allow it to maximize the value of the data and overcome the significant challenges posed by distributed assets (heterogeneity, data formats, sources, etc). For this to happen, SEDIMARK will reuse the semantic models developed in previous and ongoing EU initiatives, such as Gaia-X, IDS and NGSI-LD, and propose extensions to them to create one generic semantic model able to annotate and enrich heterogeneous data from multiple domains semantically.

Besides data, interoperability between AI models that emerge from this data is of great interest. In the decentralized environment of SEDIMARK, decentralized training requires that users train their models locally and then exchange model weights for jointly learning a global model. Ensuring that all SEDIMARK users will use the exact same machine learning platform for training the model and the exact same machines is unrealistic. So, SEDIMARK models will be agnostic to underlying platforms and SEDIMARK will provide tools to convert models to various formats and support models to run on machines of various capabilities and on various platforms.

This week, SEDIMARK participated in the workshop on ‘Tech Adoption Scenarios and Data AI 2030’ that was organised by the LeADS project (Leading Europe’s Advanced Digital Skills), which is a Coordination and Support Action (CSA) funded by the Digital Europe Programme, and among other objectives, it aims to provide guidance for the deployment of the DIGITAL programme Advanced Digital Skills (ADS) over the next 7 years.

The aim of the workshop was to assess the key predictions developed by LeADS for market adoption within the AI and Data technology areas which included over 80 tech groupings.

From SEDIMARK, University of Surrey (@cvssp_research) contributed to the co-creation exercise facilitated by Martin Robles from BluSpecs, which focused on the definitions of market dynamics in relation to data management and analysis.

The exercise involved assessing:

The applicability of the identified use cases, grouped under AI, BI/Data Science, Cloud, Cybersecurity and IoT, to data management skills relating to data collection, curation, analysis, quality and interoperability.
The hypotheses presented based on several factors, which included legislation, AI automation and Cybersecurity, that could have an impact on these skills.
The relevance and magnitude of the impact that data management skills would have on the different use cases.

Most of the use cases listed were applicable since data management plays a central role in most technologies, which include remote health monitoring, environmental monitoring detection, manufacturing operations and agricultural field monitoring.

In relation to legislation, compliance oversight over different aspects of data management will be expected to increase, and a possible impact of this is that there will be significant reservation among developers to provide decentralised and distributed solutions to address today’s massive energy consumption of cloud-based centralised systems, by handling data closer to the data source, which in many scenarios will require collaborative data sharing between different data providers. And this will inevitably raise alarms regarding data protection, and therefore compliance entities need to cooperate and clarify, rather than police tomorrow’s developers.

With regards to AI automation, the hypothesis presented was that more automation of digital systems will be driven by AI throughout an application’s development lifecycle. This would be the case for well-established uniform processes, but not yet for processes dealing with new or unfamiliar data sources especially when it comes to handling semantic interoperability.

As for Cybersecurity, an increase in experts in this field is expected due to the increase in federation of data flows and models in systems. This will highly likely be the case as there will be a need for auditing mechanisms for checking data integrity and provenance to ensure the correct use of data and AI models.

Finally, when it came to assessing the relevance and magnitude of the impact of data management on today’s main technology, most areas listed were expected to be highly affected by how skills in data management will evolve.

In conclusion, the workshop highlighted possible scenarios on how skills development in the future will be influenced, especially when it comes to balancing innovation in data management and the protection of data.

SEDIMARK knows the importance of regulating data management issues within a context such as the one posed by the project. A solution will be considered where consortium partners will deposit all underlying information on data-related business processes (data storage, data provisioning, processing etc.) of the SEDIMARK solution clearly and transparently.
The purpose of the Data Management Action Plan (DMAP) is to identify the main data management elements that apply to the SEDIMARK project and the consortium. This document is the first version of the DMAP and will be reviewed as soon as there is a clearer understanding of the types of data that will be collected.
Given the wide range of sources from which data will be collected or become available within the project, this document outlines that the consortium partners will consider embracing and applying the Guidelines on FAIR Data Management in Horizon 2020 and Horizon Europe (HE); “In general terms your data should be ‘FAIR’, that is Findable, Accessible, Interoperable and Re-usable”, as information about data to be collected becomes clearer”.