SEDIMARK Logo

1. Introduction

As part of the SEDIMARK toolbox that Users will use for configuring AI and Data Processing pipelines for their use case, AI tasks such as forecasting are made readily available for inferencing on Data Assets. Time series forecasting has a wide range of applications across various fields, including financial market prediction, weather forecasting, and traffic flow prediction.

In this tutorial, we will use Python to demonstrate the basic AI workflow for time series forecasting, specifically focusing on temperature forecasting for agriculture use cases. Accurate temperature forecasting is crucial for agriculture as it helps farmers plan their activities, manage crops, and optimize yields.

The Jupyter notebook that contains the content of this tutorial can be downloaded from Github.

2. Environment Setup

We need to install some toolboxes and libraries for this experiment. Therefore, please copy and use the below command in your python terminal:

pip install numpy pandas matplotlib scikit-learn torch

3. Data Preprocessing

In this section, we generate the simulation data and apply the preprocessing.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# Generate sample data
date_rng = pd.date_range(start='2023-01-01', end='2023-06-30', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['temperature'] = np.random.randint(20, 35, size=(len(date_rng)))

# Set date as index
df.set_index('date', inplace=True)

# Visualize data
df['temperature'].plot(figsize=(12, 6), title='Temperature Time Series')
plt.show()

# Normalize data
scaler = MinMaxScaler(feature_range=(0, 1))
df['temperature_scaled'] = scaler.fit_transform(df['temperature'].values.reshape(-1, 1))

# Split into training and testing sets
train_size = int(len(df) * 0.8)
train, test = df[:train_size], df[train_size:]

# Create dataset for Transformer
def create_dataset(data, time_step=1):
    X, Y = [], []
    for i in range(len(data) - time_step - 1):
        X.append(data[i:(i + time_step), 0])
        Y.append(data[i + time_step, 0])
    return np.array(X), np.array(Y)

time_step = 10
X_train, y_train = create_dataset(train['temperature_scaled'].values, time_step)
X_test, y_test = create_dataset(test['temperature_scaled'].values, time_step)

# Convert to PyTorch tensors
import torch
X_train = torch.tensor(X_train.reshape(X_train.shape[0], time_step, 1), dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)
X_test = torch.tensor(X_test.reshape(X_test.shape[0], time_step, 1), dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

4. Build the Simple Transformer Model

We use the toolbox and librires support provided by Pytorch to create a simple and basic Transformer model (Encoder-Decoder).

import torch.nn as nn
import torch.optim as optim

class TransformerModel(nn.Module):
    def __init__(self, num_heads, d_model, num_encoder_layers, num_decoder_layers, dff):
        super(TransformerModel, self).__init__()
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=num_heads, dim_feedforward=dff)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_encoder_layers)
        self.decoder_layer = nn.TransformerDecoderLayer(d_model=d_model, nhead=num_heads, dim_feedforward=dff)
        self.transformer_decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=num_decoder_layers)
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(d_model * time_step, dff)
        self.dense2 = nn.Linear(dff, 1)

    def forward(self, src):
        encoder_output = self.transformer_encoder(src)
        decoder_output = self.transformer_decoder(encoder_output, encoder_output)
        flatten_output = self.flatten(decoder_output)
        dense_output = self.dense1(flatten_output)
        output = self.dense2(dense_output)
        return output

# Hyperparameters
num_heads = 2
d_model = 64
num_encoder_layers = 2
num_decoder_layers = 2
dff = 128

# Create model
model = TransformerModel(num_heads, d_model, num_encoder_layers, num_decoder_layers, dff)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train model
num_epochs = 50
batch_size = 64
train_loader = torch.utils.data.DataLoader(dataset=list(zip(X_train, y_train)), batch_size=batch_size, shuffle=True)

for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs.squeeze(), batch_y)
        loss.backward()
        optimizer.step()
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

5. Model Evaluation

We evaluate our trained model on the created data.

import math
from sklearn.metrics import mean_squared_error

model.eval()
with torch.no_grad():
    train_predict = model(X_train).squeeze().numpy()
    test_predict = model(X_test).squeeze().numpy()

# Inverse transform the predictions
train_predict = scaler.inverse_transform(train_predict.reshape(-1, 1))
test_predict = scaler.inverse_transform(test_predict.reshape(-1, 1))
y_train = scaler.inverse_transform(y_train.reshape(-1, 1))
y_test = scaler.inverse_transform(y_test.reshape(-1, 1))

# Calculate RMSE
train_score = math.sqrt(mean_squared_error(y_train, train_predict))
test_score = math.sqrt(mean_squared_error(y_test, test_predict))
print(f'Train Score: {train_score} RMSE')
print(f'Test Score: {test_score} RMSE')

# Visualize predictions
plt.figure(figsize=(12, 6))
plt.plot(df['temperature'], label='Actual Data')
plt.plot(df.index[time_step:train_size], train_predict, label='Train Predict')
plt.plot(df.index[train_size+time_step+1:], test_predict, label='Test Predict')
plt.legend()
plt.show()

6. Conclusion

This tutorial demonstrates how to use a basic Transformer model for time series forecasting, specifically for temperature prediction in agriculture. Accurate temperature forecasting is essential for agricultural planning and decision-making, helping farmers optimize crop management and improve yields. Through this example, readers can gain a fundamental understanding of applying Transformers to time series forecasting and further research and optimize the model for better prediction performance.




In today’s fast-paced digital landscape, effective data processing is a critical component for any organization looking to derive insights and drive innovation. However, setting up data pipelines - from extraction to transformation and loading (ETL) - has traditionally required a high level of expertise. We’re happy to announce a solution that could democratize data orchestration for users of all experience levels: the SEDIMARK Data Orchestrator powered by Mage.ai and enhanced with Generative AI.

Simplified Data Processing with AI Assistance

Our platform integrates the power of Large Language Models (LLMs) to automatically generate Mage.ai pipeline blocks, helping even those with minimal technical background create robust data workflows. Instead of spending hours - or even days - writing code and configuring pipelines, users now simply need to upload their dataset into the Orchestrator GUI.

Once the dataset is in place, the system, with a bit of guidance through a helpful prompt, takes over the heavy lifting. Using generative AI, the platform produces custom Mage.ai templates and workflows specifically tailored to your data. This eliminates the need for users to dive deep into code or ETL specifics.

How It Works

Whether you’re dealing with traffic provided data, weather records, or IoT data streams, the process starts with uploading your dataset into the Orchestrator GUI.

With the help of generative AI and LLMs, the platform processes the data structure and requirements, and instantly generates Mage.ai pipeline blocks.

These blocks are based on pre-defined templates for tasks like data cleaning, transformation, anomaly detection, and prediction, all while allowing the flexibility to adapt to any dataset. You no longer must start from scratch.

As a less experienced user, all you need is a brief guiding prompt. The system understands the context of the data and the desired outcome, and it provides a workflow that’s ready to run.

Democratizing Data Engineering

Data engineering has often been a domain reserved for those with extensive technical know-how. With the introduction of our Generative AI-powered Data Orchestrator, this is no longer the case. By reducing the complexity and time involved in configuring ETL pipelines, we’re empowering organizations to:

  1. Accelerate time-to-value. With AI doing most of the setup work, teams can focus on what truly matters—extracting insights from their data, not configuring workflows.
  2. Reduce the learning curve. No more spending weeks learning the intricacies of ETL processes. With our platform, even unexperienced users can be up and running in no time.
  3. Produces customizable workflows. While the platform provides default templates, advanced users still have the flexibility to customize their pipelines to meet more complex or specific requirements.

What’s Next?

With the launch of this new capability, we’re excited to see how businesses will leverage it to innovate. Whether you’re building predictive models, automating anomaly detection, or simply making data-driven decisions faster, the Sedimark Data Orchestrator simplifies every step of the process.

In today’s data-driven world, finding relevant datasets is crucial for researchers, data scientists and businesses. This has led to the development of dataset recommendation systems. Similarly, as the movie recommendation system used by Netflix guiding users to discover the most relevant movies, the dataset recommender system aims to guide users to navigate this complex landscape of dataset discovery efficiently. 

Recommender systems learn to analyse user behaviour to make intelligent suggestions. This can be achieved with a variety of techniques, ranging from content-based filtering approaches that focus on the item descriptions and recommend similar items to those that the user has previously interacted with; through collaborative filtering approaches that recommend items based on interactions of similar users; to hybrid approaches combining the content and collaborative filtering. 

High-quality recommender systems provide many benefits to users such as efficiency by automating the search for relevant items, personalisation improving user satisfaction and enhanced discovery exposing users to items they may not be aware of.

It is clear that an efficient recommender system has many advantages in various domains and dataset recommendation is no different. However, with the exponential growth of data, often residing at various locations, finding the right dataset for a specific task or project has become increasingly challenging [1]. Moreover, numerous datasets lack high-quality descriptions making the discovery even harder [2]. This is particularly important for content-based recommender systems as they rely on high-quality metadata. Therefore insufficient dataset metadata information brings challenges associated with effective dataset recommendations, as high-quality recommendations rely on high-quality metadata information. 

In SEDIMARK, we aim to address the challenge of poor quality metadata in dataset recommendation with the development of novel techniques for dataset metadata enrichment. With automatic and efficient metadata enrichment, SEDIMARK can improve the overall user experience and dataset discoverability and drive better decision-making for the future.

[1] Chapman, Adriane, et al. "Dataset search: a survey." The VLDB Journal 29.1 (2020): 251-272.

[2] Reis, Juan Ribeiro, Flavia Bernadini, and Jose Viterbo. "A new approach for assessing metadata completeness in open data portals." International Journal of Electronic Government Research (IJEGR) 18.1 (2022): 1-20.

Have you ever wondered how a smart city manages to keep everything from urban planning to environmental monitoring running smoothly? The answer lies in something called Spatial Data Infrastructure (SDI). While it might sound technical, SDI framework plays a crucial role in making geographic information accessible and integrated, benefiting everyone.

Imagine a world where data about locations – from urban planning maps to environmental monitoring systems – is at your fingertips. SDI turns this vision into reality. By connecting data, technology, and people, SDI helps improve decision-making and efficiency in numerous areas of our lives.

Smart City: SEDIMARK Helsinki Pilot and Spatial Data

The SEDIMARK Helsinki pilot aims to demonstrate how Digital Twin technology can revolutionize urban mobility with spatial data as the backbone. SEDIMARK's context broker (NGSI-LD) handles linked data, property graphs, and semantics using three main constructs: Entities, Properties, and Relationships. This integration opens up opportunities for new services and the development of a functional city, aiming to enhance geospatial data integration within urban digital twins. In Helsinki, the approach focuses on transitioning from a monolithic architecture to a modular, API-driven approach, developing Digital Twin viewers and tools, and collaborating on a city-wide Geospatial Data.

Join us on this journey as we dive into the world of Spatial Data Infrastructure and see how it's making our city smarter, more efficient, and better prepared for the future.

Photo credit. https://materialbank.myhelsinki.fi/images/attraction?sort=popularity&openMediaId=6614

When we think of data, especially from diverse traffic sources, beauty isn't typically the first thing that comes to mind. Instead, we imagine numbers, graphs, and charts, all designed to convey information quickly and efficiently. However, what if we could see data not just as a tool for analysis, but as a source of inspiration, capable of producing visuals as captivating as a masterpiece by Vincent van Gogh? Just like van Gogh's "Starry Night" finds beauty in complexity and chaos, we can render data into beautiful, meaningful visualizations.

The Complexity of Traffic Data

Traffic data is inherently complex. It comes from a variety of sources of interoperable systems and devices. Each source provides a different perspective, capturing the flow of vehicles, the density of traffic, and the speed of travel at any given time. When combined, these data points create a comprehensive picture of urban movement.

From Chaos to Clarity

Much like the seemingly chaotic yet harmonious art, raw traffic data can appear overwhelming. However, through careful visualization and simulation, patterns and insights emerge. Advanced algorithms process the data, identifying trends and correlations that aren't immediately apparent. For instance, heat maps can show areas of high congestion, while flow diagrams can illustrate the movement of vehicles through a city over time.

The beauty of data

Data visualization is an art form in its own right. The choice of colors, shapes, and lines can transform a simple graph into a work of art. For example, a time-lapse visualization of traffic flow can resemble the dynamic motion in an urban city with streams of vehicles.

Helsinki mobility digital twin

Helsinki mobility digital twin paves the way for a future where cities leverage data. This data-driven revolution, fueled by powerful data visualization, holds immense potential for creating a more efficient, sustainable, and safer urban transportation landscape.

So, can traffic data be beautiful? Absolutely. All it takes is the right perspective and a touch of creativity to turn numbers into a work of art.

Photo credit: Kuva.

In the modern era of big data, the challenge of integrating and analyzing data from various sources has become increasingly complex. Different data providers often use diverse formats and structures, leading to significant challenges in achieving data interoperability. This complexity necessitates robust mechanisms to convert and harmonize data, ensuring they can be effectively used for analysis and decision-making. SEDIMARK has identified two critical components in this process and is actively working on them: data formatter and data mapper.

A data formatter is designed to convert data from various providers, each using different formats, into the NGSI-LD standardized format. This standardization is crucial because it allows data from disparate sources to be compared, combined, and analyzed in a consistent manner. Without a data formatter, the heterogeneity of data formats would pose a significant barrier to interoperability. For example, data from providers might be in XLSX format, another in JSON, and yet another in CSV. A data formatter processes these different formats, transforming them into a unified format that can be easily managed and analyzed by SEDIMARK tools.

A data mapper comes into play after data processing to store the data and maps it to a specific data model. This process involves not only aligning the data with the model but also enriching it with quality metrics and metadata. During this stage, the data mapper adds valuable information about data quality obtained during the data processing step, such as identifying outliers and their corresponding anomaly scores, and missing and redundant data identification. This enriched data model becomes a powerful asset for future analyses, giving a complete picture of the data.By converting various data formats into a standard format and then mapping and enriching the data, SEDIMARK achieves a higher level of data integration. This process ensures that data from multiple sources can be used together seamlessly, facilitating more accurate and comprehensive analyses. Moreover, the inclusion of data quality metrics during the mapping process adds a layer of reliability and trustworthiness to the data. Information about outliers, missing data, and redundancy is crucial for data scientists and analysts, as it allows them to make informed decisions and apply appropriate processing techniques.

The first letter in SEDIMARK stands for Secure. How does Security is involved into SEDIMARK? In this blog post we will present an overview of the Security and Trust Domain within SEDIMARK!

Nowadays, the proliferation of large amount of data requires to ensure the security and integrity of the information exchanged. In the traditional way, a centralized data marketplaces face security challenges such as data manipulation, unauthorized access, and lack of transparency. 

In response to these challenges, Distributed Ledger Technology (DLT) has emerged as an alternative solution, offering decentralized (see "The letter D in SEDIMARK"), immutable, and transparent data exchange mechanisms.

Enhancing Security in Data Exchange

Centralized data marketplaces are susceptible to various security vulnerabilities, including single points of failure and data breaches.

Using DLT mitigates these risks: the control is decentralized and the cryptographic mechanisms ensure the security.

In SEDIMARK Marketplace the participants can securely exchange data without relying on third-parties (intermediaries), reducing the risks for unwanted data manipulation or unauthorized access to their data (or, more in general, their assets).

Security Features

SEDIMARK will employ (... or is it already?!) key features enabled by DLT, such as smart contracts, Self-Sovereign Identity (SSI), and cryptographic primitives to enhance security and transparency of the Marketplace.

The Smart Contracts automate the execution of agreements between parties, ensuring trustless and tamper-proof transactions.

SSI allows users of the Marketplace to retain full control on their own identity, without relying on centralized authorities (see A Matter of Identities).

Finally, the cryptographic primitives are the underlying functions to ensure data security and integrity.

Ensuring Data Origin

Using cryptographic functions, such as digest, ensures the creation of a mathematically unique fingerprint for a certain asset.

Recording (or "anchoring") such value onto the DLT allows to achieve an immutable data trail.

So, every user can be certain of the origin of the asset that is purchasing.

This also leads to additional transparency enhancing the Trust in this distributed marketplace.

SEDIMARK exploits traditional cryptographic mechanisms as well as DLT to freshen up data (asset!) exchange mechanism and to secure the Marketplace.

Do you want to know more? Stay tuned for next blog posts by signing up to our newsletter below.

Follow us on @Twitter / X and LinkedIn.

* Source image: shutterstock

The “D6.3 Dissemination and Impact creation activities. First version” deliverable, presents the ongoing and carried out activities, communications and dissemination material, along with the current status of the Key Performance Indicators (KPIs) for such activities. Besides, it also includes the current efforts for the cooperation with other projects and associations.
During the first half of the project lifetime, a number of dissemination and communication activities have been carried out, reaching a large audience of variable types, including users, citizens, other research projects and the scientific community.
The content from the current document will be continued in the deliverable SEDIMARK_D6.4 Dissemination and Impact creation activities which is due in M36 (September 2025).

This document, along with all the public deliverables and documents produced by SEDIMARK, can be found in the Publications & Resources section.

Data has become a growing business of the utmost importance in the recent years of IoT technological expansion, driving crucial decision-making systems at the EU and global level, impacting all domains: industry, politics and economy, society and individuals, as well as the environment.

As the volume of instreaming data being collected, stored and processed is constantly expanding, most systems and techniques to absorb efficiently, appropriately and in a scalable manner such data, are lacking or rapidly overwhelmed by the technological revolutions. Furthermore, of great concern, the quantity of circulating private and sensitive information linked to individuals and organizations. Consequently, the data is insufficiently managed and maintained, too often misunderstood due to its complexity, lacking in high quality standards, ill-adapted to large-scale AI analytics, which in turn leads to inappropriate handling, sharing and misuse of data across borders and domains, even though they conform to European RGPD and FAIR* principles!  

For this reason, SEDIMARK uses a data orchestrator called “Mage.ai” to : (i) better organize integration of multiple data sources, applications, toolboxes, services and systems, (ii) render scalable the data workflows to improve performance and reduce bottlenecks, (iii) ensure data consistency, harmony and highest quality, (iv) guarantee data privacy and security compliant with EU regulations by anonymization and decentralized systems, and finally to (v) minimize and mitigate potential risks by automating schedules for data and system maintenance, monitoring and alerting procedures. On top of all, the orchestrator enables all actors of the data to easily manage, adapt and visualize the data situation. 

(*) Findable easily, Accessible, Interoperable, Reusable

crossmenu