The ImageCLEFmedical GANs dataset is designed for the task of detecting training data influence and generative model fingerprints in synthetic medical images. It is part of the broader ImageCLEF benchmark campaign, under the medical domain track, and has been featured across three consecutive editions (2023–2025). The dataset comprises real and synthetic axial lung CT scan slices of tuberculosis patients, with synthetic images generated using various GAN and diffusion-based models. Ground truth annotations include whether a real image was used in the training set of a generative model, as well as the specific generative model that produced each synthetic sample. The dataset supports two main subtasks: identifying the presence of real-image fingerprints in generated content, and clustering or linking synthetic images based on their model of origin or training subset. Data is organized into structured folders with 256×256 PNG images, and includes accompanying metadata to support evaluation. The resource is anonymized to ensure patient privacy and built from freely distributable data sources. The ImageCLEFmedical GANs dataset is co-developed by the ImageCLEF organizing committee, and is aimed at researchers working in generative model analysis, synthetic image forensics, and medical data privacy.
#2
FaVCI2D - Face Verification with Challenging Imposters and Diversified Demographics Dataset.
The FaVCI2D dataset is designed for the task of face verification under realistic and difficult conditions, focusing on detecting challenging imposter pairs among a demographically diverse population. It addresses limitations of earlier benchmarks by including visually similar—but different—imposter pairs selected via deep representation similarity and manual verification from a large, global pool of identities. It contains thousands of carefully curated genuine and imposter face pairs, with balanced representation across gender, age and geographic origin, along with metadata specifying gender, country and age to enable fine-grained analysis. Built exclusively from freely redistributable resources under appropriate licensing, and constructed with explicit legal and ethical considerations—including data minimization, compliance with GDPR, and secure handling of sensitive image rights—FaVCI2D is fully anonymized and non-commercial in scope. Experiments with state-of-the-art deep models show a significant performance drop compared to standard datasets, confirming its effectiveness in revealing real-world verification challenges. Developed collaboratively by AI4Media partners CEA List and University Politehnica of Bucharest (UPB), it is aimed at researchers interested in face verification robustness, fairness, bias mitigation, and privacy-aware evaluation.
The ImageCLEFfusion – Late Fusion Dataset is designed for the evaluation of decision-level ensemble techniques (late fusion) in multimedia learning tasks, and is part of the ImageCLEF benchmark campaign, with multiple editions since 2022. It provides a standardized setup where participants fuse outputs from a fixed set of pre-computed inducers rather than training new models, enabling direct comparison of fusion strategies across tasks. The dataset includes three subtasks: ImageCLEFfusion‑int, focused on media interestingness regression with 33 inducers and over 2,400 samples; ImageCLEFfusion‑div, centered on social image retrieval diversity with 117 inducers across 139 search queries; and ImageCLEFfusion‑cap, introduced in 2023, which targets multi-label medical concept detection with 85 inducers and over 7,000 chest X-ray images. Each subtask includes development and test splits, inducer outputs, and evaluation tools for metrics like F1-score, MAP@10, and ClusterRecall@20. Ground truth is provided only for development sets to ensure fair, blind testing. The dataset is curated from freely distributable resources, anonymized to protect privacy, and accompanied by tools and metadata to support robust fusion research. Developed by AI4Media partners and the ImageCLEF organizing team, the dataset is intended for researchers working on ensemble learning, model complementarity, and robust decision-level fusion in subjective multimedia analysis.
#4
ImageCLEFaware - ImageCLEF Social Media User Data Awareness Dataset
Images constitute a large part of the content shared on social networks. Their disclosure is often related to a particular context and users are often unaware of the fact that, depending on their privacy status, images can be accessible to third parties and be used for purposes which were initially unforeseen. For instance, it is common practice for employers to search information about their future employees online. Another example of usage is that of automatic credit scoring based on online data. Most existing approaches which propose feedback about shared data focus on inferring user characteristics and their practical utility is rather limited. We hypothesize that user feedback would be more efficient if conveyed through the real-life effects of data sharing. The objective of the task is to automatically score user photographic profiles in a series of situations with strong impact on her/his life. Four such situations were modeled this year and refer to searching for: (1) a bank loan, (2) an accommodation, (3) a job as waitress/waiter and (4) a job in IT. The inclusion of several situations is interesting in order to make it clear to the end users of the system that the same image will be interpreted differently depending on the context. The final objective of the task is to encourage the development of efficient user feedback, such as the YDSYO Android app.
This dataset is intended to be used for assessing the prediction of how memorable a video will be. The PMMD dataset is a subset of a collection consisting of 12,000 short videos retrieved from TRECVid and Memento10k. The dataset is annotated for short- and long-term memorability and, in its latest version, features three subtasks: a prediction, a generalization, and an EEG-based subtask. The dataset is generated from freely distributable data and is anonymized in order to protect users’ privacy. It is addressed to researchers interested in the prediction of short- and long-term memorability. AI4Media partners UPB and InterDigital are co-creators of this dataset, along with the University of Essex, Dublin City University and the Massachusetts Institute of Technology.
#6
LRRo - A Lip Reading Data Set for the Under-resourced Romanian Language
The LRRo – A Lip Reading Data Set for the Under‑Resourced Romanian Language is designed for the task of visual speech recognition (lip reading) in an under‑represented language, making it the first publicly available Romanian lip reading dataset. It comprises two word-level subcollections: Wild LRRo, sourced from Romanian TV shows, news broadcasts, and TEDx talks, containing over 35 speakers, ~1.1k word instances, a vocabulary of 21 words, and more than 20 hours of video; and Lab LRRo, collected in a controlled environment with 19 speakers, ~6.4k word instances, vocabulary of 48 words, and ~5 hours of footage. The data consists of annotated mouth-region image sequences in .jpg format, segmented per spoken word with aligned transcripts and metadata. Samples are organized into train, validation, and test splits for both subsets, with strong baseline results provided using deep neural network architectures. The dataset is anonymized and built from freely distributable sources, supporting transfer learning for general lip reading tasks. LRRo targets researchers in visual speech recognition, especially those focused on low-resource languages and transfer learning applications.
#7
DivFusion - Information Fusion for Social Image Retrieval and Diversification
The dataset is intended to be used for assessing the quality of late fusion methods. The use case scenario is for systems allowing the diversification of image search results in the context of social media. The dataset consist of a set of 672 queries and 240 diversification system outputs and is structured as following, according to the development/validation/testing procedure: devset (development data) - contains two data sets, i.e., devset1 (with 346 queries and 39 system outputs) and devset2 (with 60 queries and 56 system outputs); validset (validation data) - contains 139 queries with 60 system outputs; testset (testing data) - contains two data sets, i.e., seenIR data (with 63 queries and 56 system outputs, it contains the same diversification system outputs as in the devset2 data) and unseenIR data (with 64 queries and 29 system outputs, it contains unseen, novel diversification system outputs). All the data consists of redistributable Creative Commons licensed information from Flickr and Wikipedia, as well as content descriptors which are provided on an "as is" basis and were computed according to algorithms from the literature. The dataset was validated during the 2018 ICPR Multimedia Information Processing for Personality and Social Networks Analysis Challenge at the ChaLearn Looking at People Benchmarking Initiative.
#8
MMTF-14K - A Multifaceted Movie Trailer Dataset for Recommendation and Retrieval
This dataset is intended to be used for assessing the quality of methods for automatic prediction of movie recommendations from their content. It consits of 7k clips for 800 unique movies. Development set (devset) provides features computed from 5,562 clips corresponding to 632 unique movies, while the testset provides features for 1,315 clips corresponding to 159 unique movies from the well-known MovieLens 20M dataset. It makes use of the user ratings from the MovieLens dataset to calculate the ground truth, namely the per-movie global average rating and rating variance. The YouTube IDs of the clips are also available in the movie names of the clips. Each movie has on average about 8.5 associated clips where this value is calculated over both the devset and testset. The dataset was validated during the 2018 Recommending Movies Using Content: Which content is key? task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
#9
Div150Multidiv - A Social Image Retrieval Result Diversification Dataset with Adhoc Queries and Multiple Expert Annotations
A Social Image Retrieval Result Diversification Dataset with Adhoc Queries and Multiple Expert Annotations. This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of redistributable Creative Commons licensed information about general-purpose, multi-topic queries. Each query will be represented with up to 300 Flickr photos and their associated social metadata (e.g., title, description, geo-tagging information, number of views, and number of posted comments). The data is partitioned as following: (1) development data intended for designing and training the approaches (ca. 100 general-purpose, multi-concept queries with 30,000 images); (2) credibility data intended to estimate the global quality of tag-image content relationships for a user's contribution (metadata for ca. 3,000 users); (3) evaluation data intended for the actual benchmark (ca. 100 general-purpose, multi-concept queries with 30,000 images). The dataset was validated during the 2017 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
The Interestingness10k dataset is designed for the task of predicting multimedia interestingness in images and videos. Data consists of movie excerpts and key-frames and their corresponding ground-truth files based on the classification into interesting and non-interesting samples, interestingness score, along with a set of pre-processed descriptors. Also provided is a thorough analysis of this dataset, method and feature performance analysis, performance enhancement suggestions and many more aspects related to interestingness prediction. The dataset is generated from freely distributable data and is anonymized in order to protect users’ privacy. It is addressed to researchers interested in the prediction of image and video interestingness. University Politehnica of Bucharest and InterDigital are co-creators of this dataset along with CSC - IT Center for Science. The dataset was validated during the 2017 Predicting Media Interestingness Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
#11
Div150Multi - A Social Image Retrieval Result Diversification Dataset with Multi-topic Queries
This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of Creative Commons data for around 153 one-concept Flickr queries and 45,375 images for development and 139 Flickr queries (69 one-concept - 70 multi-concept) and 41,394 images for testing; metadata, Wikipedia pages and content descriptors for text and visual modalities. Data is annotated for the relevance and the diversity of the photos. An additional dataset used to train the credibility descriptors (an automatic estimation of the quality (correctness) of a particular user's tags) provides information for ca. 685 Flickr users and metadata for more than 3.5M images. The dataset was validated during the 2015 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
#12
Div150Adhoc - A Social Image Retrieval Result Diversification Dataset with Adhoc Multi-topic Queries
This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of Creative Commons data for a development set containing 70 queries (20,757 Flickr photos - including 35 multi-topic queries related to events and states associated with locations), a user annotation credibility set containing information for ca. 300 location-based queries and 685 users, a set providing semantic vectors for general English terms computed on top of the English Wikipedia, and a test set containing 65 queries (19,017 Flickr photos). The dataset was validated during the 2016 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
This dataset is intended to be used for benchmarking techniques that automatically detect video content that depicts violence, or predict the affective impact that video content will have on viewers (valence - arousal). It consists of around 10,000 video clips extracted from about 100-200 movies, both professionally made and amateur movies. Movies are shared under Creative Commons licenses that allow redistribution. The dataset was validated during the 2015 Affective Impact of Movies Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
#14
SynPose300 - A 3D Synthetic Dataset for the Evaluation of 3D Human Pose Estimation Techniques
This dataset is designed to support research in the areas of computer vision that foster new technologies for improving the robustness of automatic 3D pose estimation techniques, specifically with respect to variations in (i) anthropometric measurements for male and female genders, (ii) viewing distance and angle of the subject, (iii) performed human actions and (iv) clothing (e.g., large vs tight). The dataset comprises 288 videos, each 5 seconds long (at 24 fps), encoded using Xvid encoder (800x600 resolution, RGB images). The videos were generated using open source software and each is a combination of different human models wearing different clothes, performing different actions, and recorded from different camera positions. The purpose of creating the dataset is to have a simulation of real people performing specific actions, from specific points of view, but in a controllable way with the benefit having a precise Ground Truth. The complexity of the actions varies from easy to hard and each model has different anthropometric measurements. The Ground Truth is provided for each video, which consists of the 3D global coordinates of each joint of the skeleton, the camera position (location and orientation) and its focal length. With these values one can project from real world coordinates to camera coordinates. The dataset was created using 2 open source tools: MakeHuman v.1.0.2 and Blender 2.73a. MakeHuman was used to create the human models, including anthropometric measurements for limbs and clothes (the .mhm files for each type of person are provided with the data). Blender was used to animate the models previously created, to create the final video and the Ground Truth. The tool allows for each action to tune the distance from the camera to the subject and the rotation angles of the camera (the .blend files for each video, containing all the data and parameters necessary to render the videos, are provided with the data).
#15
Div150Cred - A Social Image Retrieval Result Diversification Dataset with User Tagging Credibility Estimation
This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of Creative Commons data of 300 landmark locations represented via 45,375 Flickr photos, 16M photo links for around 3.000 users, metadata, Wikipedia pages and content descriptors for text and visual modalities. Data is annotated for the relevance and the diversity of the photos. The dataset includes also information about user annotation credibility. Credibility is determined as an automatic estimation of the quality (correctness) of a particular user's tags. The dataset was validated during the 2014 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
This dataset is intended to be used for assessing the quality of methods for the detection of violent scenes and/or the recognition of some high level, violent related, concepts in movies. It contains violence annotations for 32 Hollywood movies and 86 short web videos downloaded from YouTube. The dataset was validated during the 2014 Affect in Multimedia Task: Violent Scenes Detection at the MediaEval Benchmarking Initiative for Multimedia Evaluation.
The SCOUTER dataset is designed to support the task of retrieving and tracking specified targets (e.g. people or vehicles) across multiple real-world surveillance cameras, using only a few labeled training examples per subject—a setup aligned with multiple-instance learning. The dataset Scouter dataset features approximately 36,000 manually annotated frames captured across 30 video documents (3 separate recording days × 10 cameras) from both indoor and outdoor environments with varied lighting and occlusion conditions. The data is organized into train and test splits: the training split includes only 180 frames (60 positive, 120 negative examples) drawn from a single camera, while the test split comprises the remaining ~36,000 frames across all cameras, making it a challenging few-shot evaluation setting. Each target subject is represented as a "bag" of instances, and retrieval is performed by modeling generative distributions over feature bags using Fisher kernel representations combined with a range of visual descriptors. The dataset is drawn from publicly distributable surveillance data, anonymized to preserve privacy, and accompanied by detailed annotations, feature descriptions, and evaluation protocols. It is intended for researchers working on low-shot object retrieval, person re-identification, and multiple-instance learning in complex video surveillance contexts.
#18
Div400 - A Social Image Retrieval Result Diversification Dataset
This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of Creative Commons data related to 396 landmark locations and contains 43,418 Flickr photos together with their Wikipedia and Flickr metadata and some content descriptor information (visual and text). Data is annotated for the relevance and the diversity of the photos (both expert and crowd annotations are provided). The dataset was validated during the 2013 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.