June 2021
Annotation tools

Machine learning datasets

A list of machine learning datasets from across the web.

Use this form to add new datasets to the list.

Subscribe to get updates when new datasets and tools are released.
A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Casual Conversations is composed of over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models.
The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.
The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description.
With object trajectories and corresponding 3D maps for over 100,000 segments, each 20 seconds long and mined for interesting interactions, our new motion dataset contains more than 570 hours of unique data.
TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
Contains spoken English commands for setting timers, setting alarms, unit conversions, and simple math. The dataset contains around ~2,200 spoken audio commands from 95 speakers, representing 2.5 hours of continuous audio.
CaseHOLD contains 53,000 multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, which could be cited.
Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review in connection with a corporate transaction, including mergers & acquisitions, etc.
WebFace260M is a new million-scale face benchmark, which is constructed for the research community towards closing the data gap behind the industry.
A billion-word corpus of Danish text, freely distributed with attribution.
The ONCE dataset is a large-scale autonomous driving dataset with 2D&3D object annotations. Includes 1 Million LiDAR frames, 7 Million camera images.
Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. It comprises a large set of 4006 images which are evenly distributed between fog, nighttime, rain, and snow.
A Dataset of Sky Images and their Irradiance values. SkyCam dataset is a collection of sky images from a variety of locations with diverse topological characteristics (Swiss Jura, Plateau and Pre-Alps regions), from both single and stereo camera settings coupled with a high-accuracy pyranometers. The dataset was collected with a high frequency with a data sample every 10 seconds.
A dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.
A dataset of “in the wild” portrait videos. The videos are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totaling up to 142 videos, 32 minutes, and 17 GBs.
A novel dataset covering seasonal and challenging perceptual conditions for autonomous driving.
This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories.
MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge.
Cube++ is a novel dataset collected for computational color constancy. It has 4890 raw 18-megapixel images, each containing a SpyderCube color target in their scenes, manually labelled categories, and ground truth illumination chromaticities.
Large-scale Person Re-ID Dataset. SYSU-30k contains 29,606,918 images.
Smithsonian Open Access, where you can download, share, and reuse millions of the Smithsonian’s images—right now, without asking. With new platforms and tools, you have easier access to more than 3 million 2D and 3D digital items.
The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. Includes 15000 annotated videos and 4M annotated images.
MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Consists of: 217,060 figures from 131,410 open access papers, 7507 subcaption and subfigure annotations for 2069 compound figures, Inline references for ~25K figures in the ROCO dataset.
CLUE: A Chinese Language Understanding Evaluation Benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.
Ruralscapes Dataset for Semantic Segmentation in UAV Videos. Ruralscapes is a dataset with 20 high quality (4K) videos portraying rural areas.
Fashionpedia is a dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.
Social Bias Inference Corpus (SBIC contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
COVID19 severity score assessment project and database. 4703 CXR of COVID19 patients.
MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) based on the dataset Flickr-Faces-HQ (FFHQ).
A holistic dataset for movie understanding. 1.1K Movies, 60K trailers.
ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses.
The largest production recognition dataset containing 10,000 products frequently bought by online customers in JD.com
HAA500, a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames.
The dataset contains over 16.5k (16557) fully pixel-level labeled segmentation images.
Human-centric Video Analysis in Complex Events. HiEve dataset includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).
AViD is a large-scale video dataset with 467k videos and 887 action classes. The collected videos have a creative-commons license.
GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral.
DoQA is a dataset for accessing Domain Specific FAQs via conversational QA that contains 2,437 information-seeking question/answer dialogues (10,917 questions in total) on three different domains: cooking, travel and movies.
BIMCV-COVID19+: a large annotated dataset of RX and CT images of COVID19 patients. This first iteration of the database includes 1380 CX, 885 DX and 163 CT studies.
MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. More than 220,000 object masks in more than 80,000 images.
Violin (VIdeO-and-Language INference), consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows).
ClarQ: A large-scale and diverse dataset for Clarification Question Generation. Consists of ~2M examples distributed across 173 domains of stackexchange.
KeypointNet is a large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations, based on ShapeNet models.
TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
A large-scale video dataset, featuring clips from movies with detailed captions. Over 3,000 diverse movies from a variety of genres, countries and decades.
DDAD (Dense Depth for Autonomous Driving) is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting.
PandaSet combines Hesai’s best-in-class LiDAR sensors with Scale AI’s high-quality data annotation. PandaSet features data collected using a forward-facing LiDAR with image-like resolution (PandarGT) as well as a mechanical spinning LiDAR (Pandar64). The collected data was annotated with a combination of cuboid and segmentation annotation (Scale 3D Sensor Fusion Segmentation). 48,000 camera images and 16,000 LiDAR sweeps.
Dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. Each video is from the BDD100K dataset.
VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube. 200,000+ videos, 550+ hours, 310+ classes.
We introduce RISE, the first large-scale video dataset for Recognizing Industrial Smoke Emissions. Our dataset contains 12,567 clips with 19 distinct views from cameras on three sites that monitored three different industrial facilities.
A dataset of almost ~4,000 TLDRs written about AI research papers hosted on the 'OpenReview' publishing platform. SciTLDR includes at least two high-quality TLDRs for each paper.
Yoga-82: A New Dataset for Fine-grained Classification of Human Poses. A dataset for yoga pose classification with 3 level hierarchy based on body pose. It is constructed from web images and consists of 82 yoga poses.
AmbigQA, a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. A dataset covering 14,042 questions from NQ-open.
A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.
Smarthome has been recorded in an apartment equipped with 7 Kinect v1 cameras. It contains 31 daily living activities and 18 subjects. The videos were clipped per activity, resulting in a total of 16,115 video samples.
Dataset is built upon the TV drama "Another Miss Oh" and it contains 16,191 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels. We provide 217,308 annotated images with rich character-centered annotations.
Mapillary Street-Level Sequences (MSLS) is the largest, most diverse dataset for place recognition, containing 1.6 million images in a large number of short sequences.
The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19.
A database of COVID-19 cases with chest X-ray or CT images.
A dataset with16,756 chest radiography images across 13,645 patient cases. The current COVIDx dataset is constructed from other open source chest radiography datasets.
Open Images V6 expands the annotation of the Open Images dataset with a large set of new visual relationships, human action annotations, and image-level labels. This release also adds localized narratives, a completely new form of multimodal annotations that consist of synchronized voice, text, and mouse traces over the objects being described. In Open Images V6, these localized narratives are available for 500k of its images. It also includes localized narratives annotations for the full 123k images of the COCO dataset.
A challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. Each log in the dataset is time-stamped and contains raw data from all the sensors, calibration values, pose trajectory, ground truth pose, and 3D maps.
P-DESTRE is a multi-session dataset of videos of pedestrians in outdoor public environments, fully annotated at the frame level.
A Multi-view Multi-source Benchmark for Drone-based Geo-localization annotates 1652 buildings in 72 universities around the world.
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
PANDA is the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The scenes may contain 4k head counts with over 100× scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.
SVIRO is a Synthetic dataset for Vehicle Interior Rear seat Occupancy detection and classification. The dataset consists of 25.000 sceneries across ten different vehicles and we provide several simulated sensor inputs and ground truth data.
An update to the popular All the News dataset published in 2017. This dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
A novel in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix™ mobile social platform.
MoVi is the first human motion dataset to contain synchronized pose, body meshes and video recordings. Dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data.
A large-scale unconstrained crowd counting dataset A comprehensive dataset with 4,372 images and 1.51 million annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions.
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation.
First dataset for computer vision research of dressed humans with specific geometry representation for the clothes. It contains ~2 Million images with 40 male/40 female performing 70 actions.
AU-AIR dataset is the first multi-modal UAV dataset for object detection. It meets vision and robotics for UAVs having the multi-modal data from different on-board sensors, and pushes forward the development of computer vision and robotic algorithms targeted at autonomous aerial surveillance. >2 hours raw videos, 32,823 labelled frames,132,034 object instances.
Open-source dataset for autonomous driving in wintry weather. The CADC dataset aims to promote research to improve self-driving in adverse weather conditions. This is the first public dataset to focus on real world driving data in snowy weather conditions. It features: 56,000 camera images, 7,000 LiDAR sweeps, 75 scenes of 50-100 frames each.
A billion-scale bitext data set for training translation models. CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models with more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set.
A collection of high resolution synthetic overhead imagery for building segmentation. Synthinel-1 consists of 2,108 synthetic images generated in nine distinct building styles within a simulated city. These images are paired with "ground truth" annotations that segment each of the buildings. Synthinel also has a subset dataset called Synth-1, which contains 1,640 images spread across six styles.
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora.
Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94, 986 high-quality aerial images from 3, 432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel.
Show all datasets
You can find more datasets at the UCI machine learning repository, Quantum stat NLP database and Kaggle datasets.
Subscribe to get updates when new datasets and tools are released.
© 2021 Nikola Plesa | Privacy | Datasets | Annotation tools