Updated
December 2021
Annotation tools

Machine learning datasets

A list of machine learning datasets from across the web.

Use this form to add new datasets to the list.

Subscribe to get updates when new datasets and tools are released.
A multitask benchmarking framework comprising complementary data modalities at a city-scale size, registered across different representations, and enriched with human and machine generated annotations. 27,745 high-resolution 360° images with human-curated annotations, 3D point clouds from: aerial and street-level LIDAR, Structure-from-Motion and Multiview-Stereo reconstructions, geo-anchored based on high-precision, survey-grade ground control points. Full aerial image cover with 7.5 cm/px resolution. Manually labeled 2D / 3D object annotations for up to 39 semantic categories.
A dataset of building footprints to support social good applications. The dataset contains 516M building detections, across an area of 19.4M km2 (64% of the African continent).
Facebook AI and Matterport have collaborated on the release of the largest-ever 3D dataset of indoor spaces made up of accurately-scaled residential and commercial spaces. The dataset consists of 3D Meshes and Textures of 1,000 Matterport spaces.
The Unsplash Dataset is created by 250,000+ contributing photographers and billions of searches across thousands of applications, uses, and contexts. Lite version has 25.000 images, Full version has 3.000.000+ images.
A large-scale dataset of 3D building models, contains 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models.
A photorealistic synthetic dataset for holistic indoor scene understanding. 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
An ImageNet replacement for self-supervised pretraining without humans. PASS contains 1.4 million distinct images.
A dataset of Amazon products with metadata, catalog images, and 3D models. 147,702 products and 398,212 unique catalog images in high resolution.
Unlimited Road-scene Synthetic Annotation (URSA) Dataset, a synthetic dataset containing upwards of 1,000,000 images.
https://github.com/HDCVLab/EDFace-Celeb-1M
Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Casual Conversations is composed of over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models.
A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.
The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description.
With object trajectories and corresponding 3D maps for over 100,000 segments, each 20 seconds long and mined for interesting interactions, our new motion dataset contains more than 570 hours of unique data.
TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
Contains spoken English commands for setting timers, setting alarms, unit conversions, and simple math. The dataset contains around ~2,200 spoken audio commands from 95 speakers, representing 2.5 hours of continuous audio.
CaseHOLD contains 53,000 multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, which could be cited.
Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review in connection with a corporate transaction, including mergers & acquisitions, etc.
WebFace260M is a new million-scale face benchmark, which is constructed for the research community towards closing the data gap behind the industry.
A billion-word corpus of Danish text, freely distributed with attribution.
Self-driving
The ONCE dataset is a large-scale autonomous driving dataset with 2D&3D object annotations. Includes 1 Million LiDAR frames, 7 Million camera images.
CV
Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. It comprises a large set of 4006 images which are evenly distributed between fog, nighttime, rain, and snow.
CV
A Dataset of Sky Images and their Irradiance values. SkyCam dataset is a collection of sky images from a variety of locations with diverse topological characteristics (Swiss Jura, Plateau and Pre-Alps regions), from both single and stereo camera settings coupled with a high-accuracy pyranometers. The dataset was collected with a high frequency with a data sample every 10 seconds.
A dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.
A dataset of “in the wild” portrait videos. The videos are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totaling up to 142 videos, 32 minutes, and 17 GBs.
Self-driving
A novel dataset covering seasonal and challenging perceptual conditions for autonomous driving.
This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories.
Medical
MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge.
CV
Cube++ is a novel dataset collected for computational color constancy. It has 4890 raw 18-megapixel images, each containing a SpyderCube color target in their scenes, manually labelled categories, and ground truth illumination chromaticities.
Large-scale Person Re-ID Dataset. SYSU-30k contains 29,606,918 images.
Smithsonian Open Access, where you can download, share, and reuse millions of the Smithsonian’s images—right now, without asking. With new platforms and tools, you have easier access to more than 3 million 2D and 3D digital items.
The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. Includes 15000 annotated videos and 4M annotated images.
Medical
MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Consists of: 217,060 figures from 131,410 open access papers, 7507 subcaption and subfigure annotations for 2069 compound figures, Inline references for ~25K figures in the ROCO dataset.
CLUE: A Chinese Language Understanding Evaluation Benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.
Ruralscapes Dataset for Semantic Segmentation in UAV Videos. Ruralscapes is a dataset with 20 high quality (4K) videos portraying rural areas.
Fashionpedia is a dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.
Social Bias Inference Corpus (SBIC contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Medical
COVID19 severity score assessment project and database. 4703 CXR of COVID19 patients.
Medical
MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) based on the dataset Flickr-Faces-HQ (FFHQ).
A holistic dataset for movie understanding. 1.1K Movies, 60K trailers.
ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses.
The largest production recognition dataset containing 10,000 products frequently bought by online customers in JD.com
CV
HAA500, a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames.
The dataset contains over 16.5k (16557) fully pixel-level labeled segmentation images.
CV
Human-centric Video Analysis in Complex Events. HiEve dataset includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).
CV
AViD is a large-scale video dataset with 467k videos and 887 action classes. The collected videos have a creative-commons license.
Show all datasets
You can find more datasets at the UCI machine learning repository, Quantum stat NLP database and Kaggle datasets.
Subscribe to get updates when new datasets and tools are released.
© 2021 Nikola Plesa | Privacy | Datasets | Annotation tools
hello@datasetlist.com