Open Data

A list of the different open data sets developed by our team is given below.

AI4NTD KK2.0 P1.5 STH & SCHm Dataset

Dataset for detecting Soil-Transmitted Helminths and Schistosoma mansoni eggs in Kato-Katz smears

Soil-transmitted helminth infections are caused by different species, also known as roundworms, whipworms, and hookworms. Approximately 1.5 billion people are infected with soil-transmitted helminths worldwide, and infected children are nutritionally and physically impaired. Worms are transmitted by eggs present in human faeces, which contaminate the soil in areas where sanitation is poor.

The global targets to eliminate soil transmitted helminthiasis morbidity depends on the accurate assessment of the prevalence and intensities of infections in the populations. The 2030 WHO roadmap outlines a goal to eliminate STH and SCH infections as a public health problem by 2030.

To support and accelerate this roadmap we have been developing a low-cost slide scanner to digitalise Kato-Katz stool thick smears. Using the scanner we have collected and annotated a substantial dataset of four helminth eggs.

Context-aware lifestyle monitoring

Real World Dataset for lifestyle monitoring through Human Activity Recognition (HAR) with wrist-worn wearable device.

In recent years the advancement of wearable device technology has made their use possible in a wide range of applications, including healthcare. Wearable devices are equipped with motion sensors that allow for activity tracking. They are comfortable, non-stigmatizing and unobtrusive which makes them ideal for continuous fitness and lifestyle monitoring and evaluation.

This dataset consists of accelerometer data from wrist-worn wearable device (Empatica E4). The data is collected in the real world with no collection protocol and in the participants’ natural environment over several weeks.

This dataset is licensed under CC BY-SA 4.0

Data Analytics for Health and Connected Care

Data Analytics in Health and Connected Care (DAHCC) resembles both the way and the data to describe the connected care applications, the used sensors to create such care applications together with their link to the people who are involved by or with those care applications (e.g. patients, healthcare professionals etc.).

The DAHCC resource exists out of 3 main components:

  • A large dataset of daily life activities, bot provided in raw and knowledge graph format.
  • The DAHCC Ontology capturing care, patient, daily life activity recognition and lifestyle domain knowledge.
  • Connected Care Applications which shows the potential of combining data with ontological meta-data.

Ghent Semi-spontaneous Speech Paradigm validation dataset

The purpose of the Ghent Semi-spontaneous Speech Paradigm (GSSP) dataset is to validate a newly developed speech acquisition paradigm by examining the speech style. The dataset contains more than 1000 raw audio recordings from over 80 participants. Each participant described 30 images with a consistent emotional load using the GSSP method and read aloud a fixed text seven times. The data was collected using an online web application whose code and accompanying analysis notebooks can be found on GitHub.

Examining Motor Current Sensors of Low and High Quality: Ablation Study on Window Sizes with a Fault-Emulating Setup

This dataset consists of motor current data obtained from an experiment setup made available by the University of Ghent. The setup can simulate bearing wear and tear created by various load conditions. The emulated faults are misalignment faults together with their severity. The misalignment can be emulated in both the vertical and horizontal direction. The severity of misalignment for both directions range from -0.5 mm to 0.5 mm with steps of 0.1 mm. Two signal sampling hardware tools are used: a low resolution current clamp and a high resolution current clamp. The low resolution current clamp is a Fluke i400s current clamp, while the high resolution current clamp is a Tektronix TCP-series current clamp with a complementary amplifier.

All relevant data can be found here/ while the according source can be found at github. The paper describing the methodology, data and code can be found here.

In-field leak experiments in operational water distribution networks

20 to 30% of drinking water produced is lost due to leaks in water distribution pipes. In times of water scarcity, losing so much treated water comes at a significant cost, both environmentally and economically.

To do hybrid leak localization combining both model-based and data-driven modeling, pressure heads of leak scenarios are simulated using a hydraulic model, and then used to train a machine-learning based leak localization model. Data of in-field leak experiments in operational water distribution networks were produced to evaluate our approach on realistic test data and have been provided as open data.

All relevant data can be found here/ while the according source can be found at github. The paper describing the methodology, data and code can be found here.

Tryp Microscopy Images Dataset

The Tryp dataset provides bounding box annotations for detecting Trypanosoma brucei brucei in microscopy images of unstained thick blood smears.

Extracting the Tryp.zip file unveils three main folders: positive_images, negative_images, and videos. The videos folder holds all the originally recorded videos, which were used to extract the images in the Tryp dataset and are categorized into positive and negative folders. Inside the positive_images folder are three more folders: train, validation, and test. Each folder contains two more folders, images and labels, and a JSON file. The images and labels folders hold the corresponding images and annotation files compatible with the YOLOv7 model. On the other hand, the JSON files have annotations in the MS COCO format, which is suitable for training the Faster R-CNN and RetinaNet models using the implementation by Torchvision.