Datasets for machine learning

A list of the biggest datasets for machine learning from across the web.
Email me at hello@datasetlist.com with questions, suggestions and ideas.
You can subscribe to get updates when new datasets are released.
Name License
Natural Questions (NQ), a new, large-scale corpus for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is large, consisting of 300,000 naturally occurring questions, along with human annotated answers from Wikipedia pages, to be used in training QA systems. We have additionally included 16,000 examples where answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the learned QA systems.
CC-BY-SA 3.0
Attribution-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2019
Mozilla crowdsources the largest dataset of human voices available for use, including 18 different languages, adding up to almost 1,400 hours of recorded voice data from more than 42,000 contributors.
CC-0
CC-0 - No Copyright
2019
The Diversity in Faces(DiF)is a large and diverse dataset that seeks to advance the study of fairness and accuracy in facial recognition technology.The first of its kind available to the global research community,DiF provides a dataset of annotations of 1 million human facial images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
GQA
The dataset consists of 22M questions about various day-to-day images. Each image is associated with a scene graph of the image's objects, attributes and relations, a new cleaner version based on Visual Genome.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
MIMIC-CXR is a large, publicly-available database comprising of de-identified chest radiographs from patients admitted to the Beth Israel Deaconess Medical Center between 2011 and 2016. The dataset contains 371,920 chest x-rays associated with 227,943 imaging studies. Each imaging study can pertain to one or more images, but most often are associated with two images: a frontal view and a lateral view. Images are provided with 14 labels derived from a natural language processing tool applied to the corresponding free-text radiology reports.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
CheXpert is a large public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Facebook BISON (Binary Image Selection) dataset complements the COCO Captions dataset. BISON-COCO is not a training dataset, but rather an evaluation dataset that can be used to test existing models’ ability for pairing visual content with appropriate text descriptions.
Not found
License information not found
2019
SPEED consists of synthetic as well as actual camera images of a mock-up of the Tango spacecraft from the PRISMA mission. The synthetic images are created by fusing OpenGL-based renderings of the spacecraft’s3D model with actual images of the Earth captured by the Himawari-8 meteorolog-ical satellite. Dataset contains over 12,000 images with a resolution of 1920×1200 pixels.
CC-BY-NC-SA 3.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN). The dataset consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
Danbooru2018 is a large-scale anime image database with 3.33m+ images annotated with 99.7m+ tags; It can be useful for machine learning purposes such as image recognition and generation.
Not found
License information not found
2019
Open Images is a dataset of ~9 million URLs to images that have been annotated with image-level labels and bounding boxes spanning thousands of classes.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2018
Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding. It contains: 290k multiple choice questions 290k correct answers and rationales: one per question 110k images Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach Answers are 7.5 words on average; rationales are 16 words. High human agreement (>90%) Scaffolded on top of 80 object categories from COCO
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2018
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2018
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
CC-BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2018
The dataset contains over 100k videos of driving experience, each running 40 seconds at 30 frames per second. The total image count is 800 times larger than Baidu ApolloScape (released March 2018), 4,800 times larger than Mapillary and 8,000 times larger than KITTI.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2018
Situations With Adversarial Generations is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning. The dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. The correct answer is the (real) video caption for the next event in the video; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2018
The highD dataset is a new dataset of naturalistic vehicle trajectories recorded on German highways. Using a drone, typical limitations of established traffic data collection methods such as occlusions are overcome by the aerial perspective. Traffic was recorded at six different locations and includes more than 110 500 vehicles.
Non-commercial & commercial
Non-commercial and commercial licenses available
2018
comma.ai presents comma2k19, a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. comma2k19 is a fully reproducible and scalable dataset.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2018
An autonomous driving dataset and benchmark for optical flow. > 1000 frames at 2560x1080 with diverse lighting and weather scenarios, reference data with error bars for optical flow, evaluation masks for dynamic objects, specific robustness evaluation on challenging scenes. The dataset includes: 110 500 vehicles 44 500 driven kilometers 147 driven hours
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2018
VQA is a dataset containing open-ended questions about images. These questions require an understanding of vision and language. It contains 265,016 images (COCO and abstract scenes), at least 3 questions (5.4 questions on average) per image, 10 ground truth answers per question.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2018
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.
Various
The majority of the corpus is released under the OANC’s license, which allows all content to be freely used, modified, and shared under permissive terms. The data in the FICTION section falls under several permissive licenses; Seven Swords is available under a Creative Commons Share-Alike 3.0 Unported License, and with the explicit permission of the author, Living History and Password Incorrect are available under Creative Commons Attribution 3.0 Unported Licenses; the remaining works of fiction are in the public domain in the United States (but may be licensed differently elsewhere).
2018
ApolloScape is an order of magnitude bigger and more complex than existing similar datasets such as Kitti and CityScapes. ApolloScape offers 10 times more high-resolution images with pixel-by-pixel annotations, and includes 26 different recognizable objects such as cars, bicycles, pedestrians and buildings. The dataset offers several levels of scene complexity with increasing number of pedestrians and vehicles, up to 100 vehicles in a given scene, as well as a wider set of challenging environments such as heavy weather or extreme lighting conditions.
Non-commercial
All photos can only be used for educational purpose by individuals or organizations. Commercial use or other violations of copyright law are not permitted.
2018
The nuScenes dataset is a large-scale autonomous driving dataset. It features: ● Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS) ● 1000 scenes of 20s each ● 1,440,000 camera images ● 400,000 lidar sweeps ● Two diverse cities: Boston and Singapore
CC BY-NC-SA 4.0 or commercial
Attribution-NonCommercial-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2018
MURA (musculoskeletal radiographs) is a large dataset of bone X-rays that can be used to train algorithms tasked with detecting abnormalities in X-rays. MURA is believed to be the world’s largest public radiographic image dataset with 40,561 labeled images.
Non-commercial
Stanford University School of Medicine MURA Dataset Research Use Agreement (see website for license)
2018
A photorealistic synthetic dataset for street scene parsing. The images in the dataset do not follow a driven path through a single virtual world. Instead, an entirely unique scene was procedurally generated for each of the 25,000 images. As a result, the dataset contains a wide range of variations and unique combinations of features.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2018
CoQA is a large-scale dataset for building Conversational Question Answering systems. CoQA contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.
Various
CoQA contains passages from seven domains. We make five of these public under the following licenses: Literature and Wikipedia passages are shared under CC BY-SA 4.0 license. Children's stories are collected from MCTest which comes with MSR-LA license. Middle/High school exam passages are collected from RACE which comes with its own license. News passages are collected from the DeepMind CNN dataset which comes with Apache license.
2018
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset. Spider consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
CC BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2018
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. The dataset is composed of 113,000 QA pairs based on Wikipedia.
CC BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2018
Tencent ML — Images is the largest open-source multi-label image dataset, including 17,609,752 training and 88,739 validation image URLs which are annotated with up to 11,166 categories.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2018
Acollaborative research project from Facebook AI Research (FAIR) and NYU Langone Health to investigate the use of AI to make MRI scans up to 10 times faster. The dataset includes more than 1.5 million anonymous MRI images of the knee, drawn from 10,000 scans, and raw measurement data from nearly 1,600 scans.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2018
The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.
Reasearch or commercial
Research and commercial licenses available.
2017
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located. You can browse the recognized drawings on quickdraw.withgoogle.com/data.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2017
Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2017
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. It contains data from 7,000+ speakers, 1 million+ utterances, 2,000+ hours. VoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.
CC BY-SA 4.0
"Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions."
2017
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2017
YouTube-BoundingBoxes is a large-scale data set of video URLs with densely-sampled high-quality single-object bounding box annotations. The data set consists of approximately 380,000 15-20s video segments extracted from 240,000 different publicly visible YouTube videos, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2017
Reddit Comments from 2005-12 to 2017-03. Downloaded from https://files.pushshift.io/comments.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2017
A large-scale and high-quality dataset of annotated musical notes. The NSynth Dataset is an audio dataset containing ~300k musical notes, each with a unique pitch, timbre, and envelope. Each note is annotated with three additional pieces of information based on a combination of human evaluation and heuristic algorithms: the method of sound production for the note's instrument, the high-level family of which the note's instrument is a member and sonic qualities of the note.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2017
A dataset for scene parsing. There are 20,210 images in the training set, 2,000 images in the validation set, and 3,000 images in the testing set. All the images are exhaustively annotated with objects. Many objects are also annotated with their parts. For each object there is additional information about whether it is occluded or cropped, and other attributes.
Not found
License information not found
2017
A dataset of questions from Quora aimed at determining if pairs of question text actually correspond to semantically equivalent queries. Over 400,000 lines of potential question duplicate pairs.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2017
The Yelp dataset contains data about businesses, reviews, and user data for use in personal, educational, and academic purposes. Available in both JSON and SQL files.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2017
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.
CC-BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2017
There are about 208 000 jokes in this database scraped from three sources (reddit, stupidstuff.org, wocka.com).
Various
Parts of the dataset could be under different licenses, check the dataset web page for more information
2017
The main focus of this dataset is testing. It contains data recorded under real world driving situations. Aims of it are: to compile and provide standard data which can be used for evaluation. to establish accepted evaluation protocols, data and measures. to boost the algorithm development on driving applications using computer vision techniques. The WildDash dataset does not offer enough material to train algorithms by itself.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2017
The Oxford RobotCar Dataset contains over 100 repetitions of a consistent route through Oxford, UK, captured over a period of over a year. The dataset captures many different combinations of weather, traffic and pedestrians, along with longer term changes such as construction and roadworks.
CC BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2017
A set of datasets for automatic text understanding and reasoning.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2017
Recipe1M, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data.
Not found
License information not found
2017
The MF2 training dataset is the largest (in number of identities) publicly available facial recognition dataset with a 4.7 million faces, 672K identities, and their respective bounding boxes. All images obtained from Flickr (Yahoo's dataset) and licensed under Creative Commons.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2016
MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more. The latest version of MIMIC is MIMIC-III v1.4, which comprises over 58,000 hospital admissions for 38,645 adults and 7,875 neonates. The data spans June 2001 - October 2012. The database, although de-identified, still contains detailed information regarding the clinical care of patients, so must be treated with appropriate care and respect.
Not found
License information not found
2016
It provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estimation, and 3D reconstruction. A set of 5M rendered RGB-D images from over 15K trajectories in synthetic layouts with random but physically simulated object poses.
GPL
GPL - You are free to: copy, distribute and modify the software as long as you track changes/dates in source files. Under the following terms: any modifications to or software including (via compiler) GPL-licensed code must also be made available under the GPL along with build & install instructions.
2016
Contains about 10M images for 100K celebrities. A training and benchmark testing dataset for the following task: recognizing one million celebrities from their face images and link them to the corresponding entity keys in a knowledge base.
MSR-LA
Can only be used for research and educational purposes. Commercial use is prohibited.
2016
Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer. It contains 1,010,916 user queries and 182,669 natural language answers.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2016
The SYNTHetic collection of Imagery and Annotations, is a dataset that has been generated with the purpose of aiding semantic segmentation and related scene understanding problems in the context of driving scenarios. SYNTHIA consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations. It contains: +200,000 HD images from video streams and +20,000 HD images from independent snapshots. Scene diversity: European style town, modern city, highway and green areas. Variety of dynamic objects: cars, pedestrians and cyclists.
CC BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2016
The purpose of the NewsQA dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills. Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.
Various
Parts of the dataset are under different licenses, check the dataset web page for more information
2016
The dataset contains 367,888 face annotations for 8,277 subjects divided into 3 batches. Contains bounding boxes, the extimated pose (yaw, pitch, and roll), locations of twenty-one keypoints, and gender information generated by a pre-trained neural network. The second part contains 3,735,476 annotated video frames extracted from a total of 22,075 for 3,107 subjects.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2016
7 and a quarter hours of largely highway driving.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2016
SpaceNet is an online repository of freely available satellite imagery, co-registered map data to train algorithms, and a series of public challenges designed to accelerate innovation in machine learning using geospatial data. This first of its kind open innovation project for the geospatial industry is a collaboration between CosmiQ Works, DigitalGlobe and NVIDIA. In the first year, over 5,700 km2 of very high-resolution imagery and more than 520,000 vectors were released through SpaceNet on AWS.
Various
Parts of the dataset are under different licenses, check the dataset web page for more information
2016
The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2015
ShapeNet is an ongoing effort to establish a richly-annotated, large-scale dataset of 3D shapes. ShapeNet is organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, the majority of them being nouns (80,000+).
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2015
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes.
Not found
License information not found
2015
WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels. We provide a split of 50% for training and 50% for testing.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2015
LSUN contains around one million labeled images for each of 10 scene categories and 20 object categories.
Not found
License information not found
2015
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It contains: 108,077 Images 5.4 Million Region Descriptions 1.7 Million Visual Question Answers 3.8 Million Object Instances
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2015
An extensive set of eight datasets for text classification. Datasets from DBPedia, Amazon, Yelp, Yahoo!, Sogou, and AG. Sample size of 120K to 3.6M, ranging from binary to 14 class problems.
Various
Parts of the dataset are under different licenses, check the dataset web page for more information
2015
Two datasets using news articles for Q&A research. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2015
Large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2015
ActivityNet is a new large-scale video benchmark for human activity understanding. ActivityNet aims at covering a wide range of complex human activities that are of interest to people in their daily living. In its current version, ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours.
Not found
License information not found
2015
Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2015
The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).
CC BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2015
COCO is a large-scale object detection, segmentation, and captioning dataset. It contains: 330K images (>200K labeled), 1.5 million object instances, 80 object categories.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2014
This dataset contains a list of photos and videos. This list is compiled from data available on Yahoo! Flickr. All the photos and videos provided in the list are licensed under one of the Creative Commons copyright licenses.
Various
Parts of the dataset are under different licenses, check the dataset web page for more information
2014
This dataset is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL object detection task by providing segmentation masks for each body part of the object.
Not found
License information not found
2014
An image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images. This is an extension of the Flickr 8k Dataset. The new images and captions focus on people involved in everyday activities and events.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2014
We introduce a challenging data set of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.
Not found
License information not found
2014
A novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, 6 hours of traffic scenarios recorded at 10-100 Hz. The scenarios are diverse, capturing real-world traffic situations and range from freeways over rural areas to innercity scenes with many static and dynamic objects.
CC BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2013
Stanford Cars dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.
Not found
License information not found
2013
The purpose of the project is to make available a standard training and test setup for language modeling experiments.
Not found
License information not found
2013
A dataset for sentiment analysis that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality.
Not found
License information not found
2013
The German Traffic Sign Benchmark is a multi-class, single-image classification challenge held at the IJCNN 2011. The dataset contains: more than 40 classes, more than 50,000 images in total.
Not found
License information not found
2012
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2011
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.
Not found
License information not found
2011
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download.
WordNet license
WordNet® is unencumbered, and may be used in commercial applications in accordance with the following license agreement. (see website for license)
2010
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).
Not found
License information not found
2009
ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2009
You can find more datasets at the UCI machine learning repository and Kaggle datasets.
You can subscribe to get updates when new datasets are released.
© 2019 Nikola Plesa | Privacy
hello@datasetlist.com