Updated
November 2020
Annotation tools

Machine learning datasets

A list of the biggest machine learning datasets from across the web.

Subscribe to get updates when new datasets and tools are released.
Name License
The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. Includes 15000 annotated videos and 4M annotated images.
C-UDA-1.0
Computational Use of Data Agreement (C-UDA): - data that is assembled from lawfully accessed, publicly available sources to be used for computational analysis.
2020
Medical
MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Consists of: 217,060 figures from 131,410 open access papers, 7507 subcaption and subfigure annotations for 2069 compound figures, Inline references for ~25K figures in the ROCO dataset.
CC-BY-NC-ND 4.0
Attribution-NonCommercial-NoDerivs International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, NoDerivs - if you make changes, you may not distribute the modified material.
2020
CLUE: A Chinese Language Understanding Evaluation Benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.
Various
The dataset contains data from several sources, check the links on the website for individual licenses
2020
Ruralscapes Dataset for Semantic Segmentation in UAV Videos. Ruralscapes is a dataset with 20 high quality (4K) videos portraying rural areas.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
Fashionpedia is a dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
Social Bias Inference Corpus (SBIC) contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Medical
COVID19 severity score assessment project and database. 4703 CXR of COVID19 patients.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Medical
MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) based on the dataset Flickr-Faces-HQ (FFHQ).
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
A holistic dataset for movie understanding. 1.1K Movies, 60K trailers.
Non-commercial
can only be used for research and educational purposes. Commercial use is prohibited.
2020
ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
The largest production recognition dataset containing 10,000 products frequently bought by online customers in JD.com
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
CV
HAA500, a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames.
Non-commercial
can only be used for research and educational purposes. Commercial use is prohibited.
2020
The dataset contains over 16.5k (16557) fully pixel-level labeled segmentation images.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
CV
Human-centric Video Analysis in Complex Events. HiEve dataset includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).
Non-commercial
can only be used for research and educational purposes. Commercial use is prohibited.
2020
CV
AViD is a large-scale video dataset with 467k videos and 887 action classes. The collected videos have a creative-commons license.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2020
GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
QA
DoQA is a dataset for accessing Domain Specific FAQs via conversational QA that contains 2,437 information-seeking question/answer dialogues (10,917 questions in total) on three different domains: cooking, travel and movies.
CC-BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2020
Medical
BIMCV-COVID19+: a large annotated dataset of RX and CT images of COVID19 patients. This first iteration of the database includes 1380 CX, 885 DX and 163 CT studies.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
CV
MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. More than 220,000 object masks in more than 80,000 images.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
CV
Violin (VIdeO-and-Language INference), consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows).
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
QA
ClarQ: A large-scale and diverse dataset for Clarification Question Generation. Consists of ~2M examples distributed across 173 domains of stackexchange.
CC BY-NC 4.0
Attribution-NonCommercial 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes.
2020
KeypointNet is a large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations, based on ShapeNet models.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
CV
TAO
TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A large-scale video dataset, featuring clips from movies with detailed captions. Over 3,000 diverse movies from a variety of genres, countries and decades.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Self-driving
DDAD (Dense Depth for Autonomous Driving) is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
Self-driving
PandaSet combines Hesai’s best-in-class LiDAR sensors with Scale AI’s high-quality data annotation. PandaSet features data collected using a forward-facing LiDAR with image-like resolution (PandarGT) as well as a mechanical spinning LiDAR (Pandar64). The collected data was annotated with a combination of cuboid and segmentation annotation (Scale 3D Sensor Fusion Segmentation). 48,000 camera images and 16,000 LiDAR sweeps.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
Dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. Each video is from the BDD100K dataset.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Audio
VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube. 200,000+ videos, 550+ hours, 310+ classes.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
We introduce RISE, the first large-scale video dataset for Recognizing Industrial Smoke Emissions. Our dataset contains 12,567 clips with 19 distinct views from cameras on three sites that monitored three different industrial facilities.
Not found
License information not found
2020
NLP
A dataset of almost ~4,000 TLDRs written about AI research papers hosted on the 'OpenReview' publishing platform. SciTLDR includes at least two high-quality TLDRs for each paper.
Apache
Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
2020
CV
Yoga-82: A New Dataset for Fine-grained Classification of Human Poses. A dataset for yoga pose classification with 3 level hierarchy based on body pose. It is constructed from web images and consists of 82 yoga poses.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
QA
AmbigQA, a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. A dataset covering 14,042 questions from NQ-open.
CC-BY-SA 3.0
Attribution-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2020
A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Smarthome has been recorded in an apartment equipped with 7 Kinect v1 cameras. It contains 31 daily living activities and 18 subjects. The videos were clipped per activity, resulting in a total of 16,115 video samples.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
QA
Dataset is built upon the TV drama "Another Miss Oh" and it contains 16,191 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels. We provide 217,308 annotated images with rich character-centered annotations.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Mapillary Street-Level Sequences (MSLS) is the largest, most diverse dataset for place recognition, containing 1.6 million images in a large number of short sequences.
Research and commercial
Research and commercial licenses available.
2020
Medical
The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Medical
A database of COVID-19 cases with chest X-ray or CT images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Medical
A dataset with16,756 chest radiography images across 13,645 patient cases. The current COVIDx dataset is constructed from other open source chest radiography datasets.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
Open Images V6 expands the annotation of the Open Images dataset with a large set of new visual relationships, human action annotations, and image-level labels. This release also adds localized narratives, a completely new form of multimodal annotations that consist of synchronized voice, text, and mouse traces over the objects being described. In Open Images V6, these localized narratives are available for 500k of its images. It also includes localized narratives annotations for the full 123k images of the COCO dataset.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
A challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. Each log in the dataset is time-stamped and contains raw data from all the sensors, calibration values, pose trajectory, ground truth pose, and 3D maps.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
P-DESTRE is a multi-session dataset of videos of pedestrians in outdoor public environments, fully annotated at the frame level.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
A Multi-view Multi-source Benchmark for Drone-based Geo-localization annotates 1652 buildings in 72 universities around the world.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
CV
PANDA is the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The scenes may contain 4k head counts with over 100× scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
CV
SVIRO is a Synthetic dataset for Vehicle Interior Rear seat Occupancy detection and classification. The dataset consists of 25.000 sceneries across ten different vehicles and we provide several simulated sensor inputs and ground truth data.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2020
An update to the popular All the News dataset published in 2017. This dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
Not found
License information not found
2020
A novel in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix™ mobile social platform.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
CV
MoVi is the first human motion dataset to contain synchronized pose, body meshes and video recordings. Dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A large-scale unconstrained crowd counting dataset A comprehensive dataset with 4,372 images and 1.51 million annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
QA
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation.
Various
The dataset contains data from several sources, check the links on the website for individual licenses
2020
First dataset for computer vision research of dressed humans with specific geometry representation for the clothes. It contains ~2 Million images with 40 male/40 female performing 70 actions.
CC BY-NC 4.0
Attribution-NonCommercial 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes.
2020
CV
AU-AIR dataset is the first multi-modal UAV dataset for object detection. It meets vision and robotics for UAVs having the multi-modal data from different on-board sensors, and pushes forward the development of computer vision and robotic algorithms targeted at autonomous aerial surveillance. >2 hours raw videos, 32,823 labelled frames,132,034 object instances.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2020
Open-source dataset for autonomous driving in wintry weather. The CADC dataset aims to promote research to improve self-driving in adverse weather conditions. This is the first public dataset to focus on real world driving data in snowy weather conditions. It features: 56,000 camera images, 7,000 LiDAR sweeps, 75 scenes of 50-100 frames each.
CC BY-NC 4.0
Attribution-NonCommercial 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes.
2020
NLP
A billion-scale bitext data set for training translation models. CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models with more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
A collection of high resolution synthetic overhead imagery for building segmentation. Synthinel-1 consists of 2,108 synthetic images generated in nine distinct building styles within a simulated city. These images are paired with "ground truth" annotations that segment each of the buildings. Synthinel also has a subset dataset called Synth-1, which contains 1,640 images spread across six styles.
Not found
License information not found
2020
QA
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora.
Apache
Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
2020
Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94, 986 high-quality aerial images from 3, 432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2020
The inD dataset is a new dataset of naturalistic vehicle trajectories recorded at German intersections. Using a drone, typical limitations of established traffic data collection methods like occlusions are overcome. Traffic was recorded at four different locations. The trajectory for each road user and its type is extracted.
Research and commercial
Research and commercial licenses available.
2019
Generated human image dataset. We provide our generated images and make a large-scale synthetic dataset called DG-Market. This dataset is generated by our DG-Net and consists of 128,307 images (613MB), about 10 times larger than the training set of original Market-1501.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
ImageMonkey is a free, public open source dataset. ImageMonkey provides a platform where users can drop their photos, tag them with a label, and put them into public domain. Contains over 100,000 images.
CC-0
CC-0 - No Copyright
2019
A new dataset for natural language based fashion image retrieval. Unlike previous fashion datasets, we provide natural language annotations to facilitate the training of interactive image retrieval systems, as well as the commonly used attribute based labels.
CDLA
The CDLA agreement is similar to permissive open source licenses in that the publisher of data allows anyone to use, modify and do what they want with the data with no obligations to share any of their changes or modifications.
2019
QA
TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. TVQA+ contains 310.8k bounding boxes, linking depicted objects to visual concepts in questions and answers.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Self-driving
We leverage a simulated driving environment to create a dataset for anomaly segmentation, which we call StreetHazards. It contains 5125 traning images, 1500 test images containing 250 anomaly types.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Fallen People Data Set (FPDS), a novel benchmark for detecting fallen people lying on the floor. It consists of 6982 images, with a total of 5023 falls and 2275 non falls corresponding to people in conventional situations.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
QA
QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.
Not found
License information not found
2019
ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random. Collected to intentionally show objects from new viewpoints on new backgrounds. 50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint. 313 object classes with 113 overlapping ImageNet
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
CV
JRDB is the largest benchmark data for 2D-3D person tracking, including: Over 60K frames (67 minutes) sensor data captured from 5 stereo camera and two LiDAR sensors, 54 sequences from different locations, during day and night time, indoors and outdoors in a university campus environment. Around 2 milion high quality 2D bounding box annotations on 360° cylindrical video streams generated from 5 stereo cameras
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
CV
xBD
A dataset for assessing building damage from satellite imagery. With over 850,000 building polygons from six different types of natural disaster around the world, covering a total area of over 45,000 square kilometers, the xBD dataset is one of the largest and highest quality public datasets of annotated high-resolution satellite imagery.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
NLP
The Benchmark of Linguistic Minimal Pairs. BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars.
Not found
License information not found
2019
A Large-Scale Logo Dataset for Scalable Logo Classification. Our resulting logo dataset contains 167,140 images with 10 root categories and 2,341 categories.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Self-driving
SemanticKITTI is based on the KITTI Vision Benchmark and we provide semantic annotation for all sequences of the Odometry Benchmark. The dataset contains 28 classes including classes distinguishing non-moving and moving objects.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
CV
This project introduces a novel video dataset, named HACS (Human Action Clips and Segments). It consists of two kinds of manual annotations. HACS Clips contains 1.55M 2-second clip annotations; HACS Segments has complete action segments (from action start to end) on 50K videos. The large-scale dataset is effective for pretraining action recognition and localization models, and also serves as a new benchmark for temporal action localization.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Self-driving
A radar-centric automotive datasetbased on radar, lidar and camera data for the purposeof 3D object detection.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
CV
SEN12MS is a dataset consisting of 180,748 corresponding image triplets containing Sentinel-1 dual-pol SAR data, Sentinel-2 multi-spectral imagery, and MODIS-derived land cover maps.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining , Tianjin University, China. The benchmark dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes).
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
A list of datasets for skin image analysis, from the 'Visual Diagnosis of Dermatological Disorders: Human and Machine Performance' paper.
Various
This is a list of several datasets, check the links on the website for individual licenses
2019
NLP
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time, etc.
CC-BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2019
Self-driving
The dataset features 2D semantic segmentation, 3D point clouds, 3D bounding boxes, and vehicle bus data. Dataset includes more than 40,000 frames with semantic segmentation image and point cloud labels, of which more than 12,000 frames also have annotations for 3D bounding boxes. In addition, we provide unlabelled sensor data (approx. 390,000 frames) for sequences with several loops, recorded in three cities. A2D2 is around 2.3 TB in total.
CC BY-ND 4.0
Attribution No Derivatives 4.0 International (CC BY ND 4.0) - You are free to: Share - copy and redistribute, Under the following terms: Attribution - you must give approprate credit., NoDerivatives - you may not redistribute the modified material.
2019
The BigEarthNet is a new large-scale Sentinel-2 benchmark archive, consisting of 590,326 Sentinel-2 image patches. To construct the BigEarthNet, 125 Sentinel-2 tiles acquired between June 2017 and May 2018 over the 10 countries (Austria, Belgium, Finland, Ireland, Kosovo, Lithuania, Luxembourg, Portugal, Serbia, Switzerland) of Europe were initially selected. All the tiles were atmospherically corrected by the Sentinel-2 Level 2A product generation and formatting tool (sen2cor). Then, they were divided into 590,326 non-overlapping image patches. Each image patch was annotated by the multiple land-cover classes (i.e., multi-labels) that were provided from the CORINE Land Cover database of the year 2018.
CDLA Permissive
The CDLA-Permissive agreement is similar to permissive open source licenses in that the publisher of data allows anyone to use, modify and do what they want with the data with no obligations to share any of their changes or modifications.
2019
Facebook, Microsoft, Amazon Web Services, and the Partnership on AI have created the Deepfake Detection Challenge to encourage research into deepfake detection. Dataset consists of around 5000 videos, both original and manipulated. To build the dataset, the researchers crowdsourced videos from people while "ensuring a variability in gender, skin tone and age".
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
The WiderPerson dataset is a pedestrian detection benchmark dataset in the wild, of which images are selected from a wide range of scenarios, no longer limited to the traffic scenario. We choose 13,382 images and label about 400K annotations with various kinds of occlusions. We randomly select 8000/1000/4382 images as training, validation and testing subsets.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
CV
3D60 is a collective dataset generated in the context of various 360 vision research works. It comprises multi-modal (i.e. color, depth and normal) omnidirectional stereo renders (i.e. horizontal and vertical) of scenes from realistic and synthetic large-scale 3D datasets (Matterport3D, Stanford2D3D, SunCG). Contains 224,406 spherical panoramas.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split".
Not found
License information not found
2019
The Oxford Radar RobotCar Dataset is a radar extension to The Oxford RobotCar Dataset. We provide data from a Navtech CTS350-X Millimetre-Wave FMCW radar and Dual Velodyne HDL-32E LIDARs with optimised ground truth radar odometry for 280 km of driving around Oxford, UK (in addition to all sensors in the original Oxford RobotCar Dataset).
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
The Total-Text consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
BSD
BSD 3-Clause "New" or "Revised" License - A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the project or its contributors to promote derived products without written consent.
2019
NLP
ArT
ArT is a combination of Total-Text, SCUT-CTW1500 and Baidu Curved Scene Text, which were collected with the motive of introducing the arbitrary-shaped text problem to the scene text community. There is a total of 10,166 images in the ArT dataset. The ArT dataset was collected with text shape diversity in mind, hence all existing text shapes (i.e. horizontal, multi-oriented, and curved) have high number of existence in the dataset, which makes it an unique dataset.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
DeepFake Forensics (Celeb-DF) dataset contains real and DeepFake synthesized videos having similar visual quality on par with those circulated online. The Celeb-DF dataset includes 408 original videos collected from YouTube with subjects of different ages, ethic groups and genders, and 795 DeepFake videos synthesized from these real videos.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 different conditions) with 12 object classes (similar to PASCAL VOC) annotated on both image class level and local object bounding boxes
BSD
BSD 3-Clause "New" or "Revised" License - A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the project or its contributors to promote derived products without written consent.
2019
Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation.
Not found
License information not found
2019
CV
The Smartphone Image Denoising Dataset (SIDD), of ~30,000 noisy images from 10 scenes under different lighting conditions using five representative smartphone cameras and generated their ground truth images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Self-driving
A new dataset recorded in Brno, Czech Republic. It offers data from four WUXGA cameras, two 3D LiDARs, inertial measurement unit, infrared camera and especially differential RTK GNSS receiver with centimetre accuracy which, to the best knowledge of the authors, is not available from any other public dataset so far. In addition, all the data are precisely timestamped with sub-millisecond precision to allow wider range of applications. At the time of publishing of the paper, it contains recordings of more than 350 km of rides in varying environments.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2019
Dataset of Human Eye Fixation over Crowd Videos. CrowdFix includes 434 videos with diverse crowd scenes, containing a total of 37,493 frames and 1,249 seconds. The diverse content refers to different crowd activities under three distinct categories - Sparse, Dense Free Flowing and Dense Congested. All videos are at 720p resolution and 30 Hz frame rate.
Not found
License information not found
2019
Self-driving
The INTERACTION dataset contains naturalistic motions of various traffic participants in a variety of highly interactive driving scenarios. Using drones and traffic cameras, trajectories were captured from different countries, including the US, Germany, China and other countries.
Research and commercial
Research and commercial licenses available.
2019
DIODE (Dense Indoor and Outdoor DEpth) is a dataset that contains diverse high-resolution color images with accurate, dense, wide-range depth measurements. It is the first public dataset to include RGBD images of indoor and outdoor scenes obtained with one sensor suite.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2019
100,000 Faces Generated by AI. We have built an original machine learning dataset, and used StyleGAN (an amazing resource by NVIDIA) to construct a realistic set of 100,000 faces. Our dataset has been built by taking 29,000+ photos of 69 different models over the last 2 years in our studio.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Objects365 is a brand new dataset, designed to spur object detection research with a focus on diverse objects in the Wild: 365 categories 600k images 10 million bounding boxes
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures. The data has been sourced from 977 youtube videos and all videos contain a trackable mostly frontal face without occlusions which enables automated tampering methods to generate realistic forgeries. As we provide binary masks the data can be used for image and video classification as well as segmentation. In addition, we provide 1000 Deepfakes models to generate and augment new data.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
We introduce a large-scale dataset called TabFact(website: https://tabfact.github.io/), which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations are classified as ENTAILED and REFUTED.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2019
CURE-TSD: Challenging Unreal and Real Environments for Traffic Sign Detection. The video sequences in the CURE-TSD dataset are grouped into two classes: real data and unreal data. Real data correspond to processed versions of sequences acquired from real world. Unreal data corresponds to synthesized sequences generated in a virtual environment. There are 49 real sequences and 49 unreal sequences that do not include any specific challenge. We have 34 training videos and 15 test videos in both real and unreal sequences that are challenge-free. There are 300 frames in each video sequence. There are 49 challenge-free real video sequences processed with 12 different types of effects and 5 different challenge levels. Moreover, there are 49 synthesized video sequences processed with 11 different types of effects and 5 different challenge levels. In total, there are 5,733 video sequences, which include around 1.72 million frames.
Not found
License information not found
2019
Urban Modelling Group at University College Dublin (UCD) captured major area of Dublin city centre (i.e. around 5.6 km^2 including partially covered areas) was scanned via an ALS device which was carried out by helicopter in 2015. However, the actual focused area was around 2 km^2 which contains the most densest LiDAR point cloud and imagery dataset. The flight altitude was mostly around 300m and the total journey was performed in 41 flight path strips. The datasets is made up of over 260 million laser scanning points labelled into 100,000 objects.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Self-driving
A*3D dataset is a step forward to make autonomous driving safer for pedestrians and the public in the real world. 230K human-labeled 3D object annotations in 39,179 LiDAR point cloud frames and corresponding frontal-facing RGB images. Captured at different times (day, night) and weathers (sun, cloud, rain).
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
A dataset consisting of 502 dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.
CC-BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2019
QMUL-OpenLogo contains 27,083 images from 352 logo classes, built by aggregating and refining 7 existing datasets and establishing an open logo detection evaluation protocol.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
The dataset consists of 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
CC-BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2019
Self-driving
The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology. The Waymo Open Dataset currently contains lidar and camera data from 1,000 segments (20s each): 1,000 segments of 20s each, collected at 10Hz (200,000 frames) in diverse geographies and conditions, Labels for 4 object classes - Vehicles, Pedestrians, Cyclists, Signs, 12M 3D bounding box labels with tracking IDs on lidar data, 1.2M 2D bounding box labels with tracking IDs on camera data...
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
Self-driving
A comprehensive, large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. This dataset also includes high quality, human-labelled 3D bounding boxes of traffic agents, an underlying HD spatial semantic map. Contains over 55,000 human-labeled 3D annotated frames; data from 7 cameras and up to 3 lidars; a drivable surface map; and, an underlying HD spatial semantic map. A semantic map provides context to reason about the presence and motion of the agents in the scenes. The provided map has over 4000 lane segments (2000 road segment lanes and about 2000 junction lanes) , 197 pedestrian crosswalks, 60 stop signs, 54 parking zones, 8 speed bumps, 11 speed humps.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
NLP
Open WebText – an open source effort to reproduce OpenAI’s WebText dataset. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Dataset was created by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The the remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.
Various
Dataset packaging is licensed under CC-0 but contains content that can have a different license, check the dataset download for more details.
2019
CV
LVIS is a new dataset for long tail object instance segmentation. 1000+ Categories: found by data-driven object discovery in 164k images. More than 2.2 million high quality instance segmentation masks.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
QA
CODAH is an adversarially-constructed evaluation dataset with 2.8k questions for testing common sense. CODAH forms a challenging extension to the SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video.
Not found
License information not found
2019
Taco is an open image dataset of waste in the wild. It contains photos of litter taken under diverse environments, from tropical beaches to London streets. These images are manually labeled and segmented according to a hierarchical taxonomy to train and evaluate object detection algorithms.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
A diverse street-level imagery dataset with bounding box annotations for detecting and classifying traffic signs around the world. 100,000 high-resolution images from all over the world with bounding box annotations of over 300 classes of traffic signs. The fully annotated set of the Mapillary Traffic Sign Dataset (MTSD) includes a total of 52,453 images with 257,543 traffic sign bounding boxes. The additional, partially annotated dataset contains 47,547 images with more than 80,000 signs that are automatically labeled with correspondence information from 3D reconstruction.
Research and commercial
Research and commercial licenses available.
2019
Self-driving
Argoverse is a research collection with three distinct types of data. The first is a dataset with sensor data from 113 scenes observed by our fleet, with 3D tracking annotations on all objects. The second is a dataset of 300,000-plus scenarios observed by our fleet, wherein each scenario contains motion trajectories of all observed objects. The third is a set of HD maps of several neighborhoods in Pittsburgh and Miami, to add rich context for all of the data mentioned above.
CC-BY-NC-SA 4.0
Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
2019
The dataset contains rigorously annotated and validated videos, questions and answers, as well as annotations for the complexity level of each question and answer. Social-IQ brings novel challenges to the field of artificial intelligence which sparks future research in social intelligence modeling, visual reasoning, and multimodal question answering. 1,250 videos, 7,500 questions, 33,000 correct answers, 22,500 incorrect answers.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
QA
DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.
CC-BY-SA 4.0
Attribution-ShareAlike 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
2019
SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Full citation list of the datasets contained: {The CommitmentBank}: Investigating projection in naturally occurring discourse, Choice of plausible alternatives: An evaluation of commonsense causal reasoning, Looking beyond the surface: A challenge set for reading comprehension over multiple sentences, The {PASCAL} recognising textual entailment challenge, The second {PASCAL} recognising textual entailment challenge, The third {PASCAL} recognizing textual entailment challenge, The Fifth {PASCAL} Recognizing Textual Entailment Challenge, {WiC}: The Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations, The {W}inograd schema challenge.
Various
The dataset contains data from several sources, check the links on the website for individual licenses
2019
Human Activity Knowledge Engine (HAKE) aims at promoting the human activity/action understanding. As a large-scale knowledge base, HAKE is built upon existing activity datasets, and supplies human instance action labels and corresponding body part level atomic action labels (Part States). Dataset contains 104 K+ images, 154 activity classes, 677 K+ human instances.
Not found
License information not found
2019
CV
PedX is a large-scale multi-modal collection of pedestrians at complex urban intersections. The dataset provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. The data was captured using two pairs of stereo cameras and four Velodyne LiDAR sensors.
MIT
MIT - You are free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work.
2019
CV
The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, planar segmentation as well as semantic class and instance segmentation.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
A large-scale vehicle ReID dataset in the wild (VERI-Wild) is captured from a large CCTV surveillance system consisting of 174 cameras across one month (30× 24h) under unconstrained scenarios. The cameras are distributed in a large urban district of more than 200km2. After data cleaning and annotation, 416,314 vehicle images of 40,671 identities are collected.
Not found
License information not found
2019
The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above ground. A high resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
This is the second version of the Google Landmarks dataset, which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
A large dataset of almost two million annotated vehicles for training and evaluating object detection methods. 200,000 images. 1,990,000 annotated vehicles. 5 Megapixel resolution.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
The Unsupervised Llamas dataset was annotated by creating high definition maps for automated driving including lane markers based on Lidar. The automated vehicle can be localized against these maps and the lane markers are projected into the camera frame. The 3D projection is optimized by minimizing the difference between already detected markers in the image and projected ones. Further improvements can likely be achieved by using better detectors, optimizing difference metrics, and adding some temporal consistency. Over 100,000 annotated images. Annotations of over 100 meters. Resolution of 1276 x 717 pixels.
Non-commercial
Can only be used for research and educational purposes. Commercial use is prohibited.
2019
CV
Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations. Open Images V5 features segmentation masks for 2.8 million object instances in 350 categories. Unlike bounding-boxes, which only identify regions in which an object is located, segmentation masks mark the outline of objects, characterizing their spatial extent to a much higher level of detail.
CC BY 4.0
Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
2019
Show all datasets
You can find more datasets at the UCI machine learning repository, Quantum stat NLP database and Kaggle datasets.
Subscribe to get updates when new datasets and tools are released.
© 2020 Nikola Plesa | Privacy | Datasets | Annotation tools
hello@datasetlist.com