Updated
April 2020
Annotation tools

Machine learning datasets

A list of the biggest machine learning datasets from across the web.

Subscribe to get updates when new datasets and tools are released.
Name License
Mapillary Street-Level Sequences (MSLS) is the largest, most diverse dataset for place recognition, containing 1.6 million images in a large number of short sequences.
Research and commercial
Research and commercial licenses available.
2020
The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A database of COVID-19 cases with chest X-ray or CT images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A dataset with16,756 chest radiography images across 13,645 patient cases. The current COVIDx dataset is constructed from other open source chest radiography datasets.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Open Images V6 expands the annotation of the Open Images dataset with a large set of new visual relationships, human action annotations, and image-level labels. This release also adds localized narratives, a completely new form of multimodal annotations that consist of synchronized voice, text, and mouse traces over the objects being described. In Open Images V6, these localized narratives are available for 500k of its images. It also includes localized narratives annotations for the full 123k images of the COCO dataset.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
A challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. Each log in the dataset is time-stamped and contains raw data from all the sensors, calibration values, pose trajectory, ground truth pose, and 3D maps.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
P-DESTRE is a multi-session dataset of videos of pedestrians in outdoor public environments, fully annotated at the frame level.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
A Multi-view Multi-source Benchmark for Drone-based Geo-localization annotates 1652 buildings in 72 universities around the world.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
PANDA is the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The scenes may contain 4k head counts with over 100× scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
SVIRO is a Synthetic dataset for Vehicle Interior Rear seat Occupancy detection and classification. The dataset consists of 25.000 sceneries across ten different vehicles and we provide several simulated sensor inputs and ground truth data.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
An update to the popular All the News dataset published in 2017. This dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
Not found
License information not found
2020
A novel in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix™ mobile social platform.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
MoVi is the first human motion dataset to contain synchronized pose, body meshes and video recordings. Dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A large-scale unconstrained crowd counting dataset A comprehensive dataset with 4,372 images and 1.51 million annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation.
Various
The dataset contains data from several sources, check the links on the website for individual licenses
2020
First dataset for computer vision research of dressed humans with specific geometry representation for the clothes. It contains ~2 Million images with 40 male/40 female performing 70 actions.
CC BY-NC 4.0
Attribution-NonCommercial 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes.
2020
AU-AIR dataset is the first multi-modal UAV dataset for object detection. It meets vision and robotics for UAVs having the multi-modal data from different on-board sensors, and pushes forward the development of computer vision and robotic algorithms targeted at autonomous aerial surveillance. >2 hours raw videos, 32,823 labelled frames,132,034 object instances.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
Open-source dataset for autonomous driving in wintry weather. The CADC dataset aims to promote research to improve self-driving in adverse weather conditions. This is the first public dataset to focus on real world driving data in snowy weather conditions. It features: 56,000 camera images, 7,000 LiDAR sweeps, 75 scenes of 50-100 frames each.
CC BY-NC 4.0
Attribution-NonCommercial 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes.
2020
A billion-scale bitext data set for training translation models. CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models with more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A collection of high resolution synthetic overhead imagery for building segmentation. Synthinel-1 consists of 2,108 synthetic images generated in nine distinct building styles within a simulated city. These images are paired with "ground truth" annotations that segment each of the buildings. Synthinel also has a subset dataset called Synth-1, which contains 1,640 images spread across six styles.
Not found
License information not found
2020
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora.
Apache
Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
2020
Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94, 986 high-quality aerial images from 3, 432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
The inD dataset is a new dataset of naturalistic vehicle trajectories recorded at German intersections. Using a drone, typical limitations of established traffic data collection methods like occlusions are overcome. Traffic was recorded at four different locations. The trajectory for each road user and its type is extracted.
Research and commercial
Research and commercial licenses available.
2019
ImageMonkey is a free, public open source dataset. ImageMonkey provides a platform where users can drop their photos, tag them with a label, and put them into public domain. Contains over 100,000 images.
CC-0
CC-0 - No Copyright
2019
A new dataset for natural language based fashion image retrieval. Unlike previous fashion datasets, we provide natural language annotations to facilitate the training of interactive image retrieval systems, as well as the commonly used attribute based labels.
CDLA
The CDLA agreement is similar to permissive open source licenses in that the publisher of data allows anyone to use, modify and do what they want with the data with no obligations to share any of their changes or modifications.
2019
TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. TVQA+ contains 310.8k bounding boxes, linking depicted objects to visual concepts in questions and answers.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
We leverage a simulated driving environment to create a dataset for anomaly segmentation, which we call StreetHazards. It contains 5125 traning images, 1500 test images containing 250 anomaly types.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.
Not found
License information not found
2019
ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random. Collected to intentionally show objects from new viewpoints on new backgrounds. 50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint. 313 object classes with 113 overlapping ImageNet
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
You can find more datasets at the UCI machine learning repository and Kaggle datasets.
Subscribe to get updates when new datasets and tools are released.
© 2020 Nikola Plesa | Privacy | Datasets | Annotation tools
hello@datasetlist.com