Machine learning datasets

A list of machine learning datasets from across the web.

Use this form to add new datasets to the list.

Subscribe to get updates when new datasets and tools are released.

Show all datasets

Name	Year	Description	License
CV Mapillary Metropolis Dataset A multitask benchmarking framework comprising complementary data modalities at a city-scale size, registered across different representations, and enriched with human and machine generated annotations. 27,745 high-resolution 360° images with human-curated annotations, 3D point clouds from: aerial and street-level LIDAR, Structure-from-Motion and Multiview-Stereo reconstructions, geo-anchored based on high-precision, survey-grade ground control points. Full aerial image cover with 7.5 cm/px resolution. Manually labeled 2D / 3D object annotations for up to 39 semantic categories.	2021	A multitask benchmarking framework comprising complementary data modalities at a city-scale size, registered across different representations, and enriched with human and machine generated annotations. 27,745 high-resolution 360° images with human-curated annotations, 3D point clouds from: aerial and street-level LIDAR, Structure-from-Motion and Multiview-Stereo reconstructions, geo-anchored based on high-precision, survey-grade ground control points. Full aerial image cover with 7.5 cm/px resolution. Manually labeled 2D / 3D object annotations for up to 39 semantic categories.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Google Open Buildings A dataset of building footprints to support social good applications. The dataset contains 516M building detections, across an area of 19.4M km2 (64% of the African continent).	2021	A dataset of building footprints to support social good applications. The dataset contains 516M building detections, across an area of 19.4M km2 (64% of the African continent).	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
CV Matterport Habitat Facebook AI and Matterport have collaborated on the release of the largest-ever 3D dataset of indoor spaces made up of accurately-scaled residential and commercial spaces. The dataset consists of 3D Meshes and Textures of 1,000 Matterport spaces.	2021	Facebook AI and Matterport have collaborated on the release of the largest-ever 3D dataset of indoor spaces made up of accurately-scaled residential and commercial spaces. The dataset consists of 3D Meshes and Textures of 1,000 Matterport spaces.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Unsplash Dataset The Unsplash Dataset is created by 250,000+ contributing photographers and billions of searches across thousands of applications, uses, and contexts. Lite version has 25.000 images, Full version has 3.000.000+ images.	2021	The Unsplash Dataset is created by 250,000+ contributing photographers and billions of searches across thousands of applications, uses, and contexts. Lite version has 25.000 images, Full version has 3.000.000+ images.	Various Two versions: Lite: commercial and noncommercial use, Full: for noncommercial use
CV BuildingNet A large-scale dataset of 3D building models, contains 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models.	2021	A large-scale dataset of 3D building models, contains 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Apple Hypersim Dataset A photorealistic synthetic dataset for holistic indoor scene understanding. 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.	2021	A photorealistic synthetic dataset for holistic indoor scene understanding. 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.	CC-BY-SA 3.0 Attribution-ShareAlike 3.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions.
CV PASS Dataset An ImageNet replacement for self-supervised pretraining without humans. PASS contains 1.4 million distinct images.	2021	An ImageNet replacement for self-supervised pretraining without humans. PASS contains 1.4 million distinct images.	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
CV Amazon Berkeley Objects (ABO) A dataset of Amazon products with metadata, catalog images, and 3D models. 147,702 products and 398,212 unique catalog images in high resolution.	2021	A dataset of Amazon products with metadata, catalog images, and 3D models. 147,702 products and 398,212 unique catalog images in high resolution.	CC BY-NC 4.0 Attribution-NonCommercial 4.0 International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes.
CV URSA Dataset Unlimited Road-scene Synthetic Annotation (URSA) Dataset, a synthetic dataset containing upwards of 1,000,000 images.	2021	Unlimited Road-scene Synthetic Annotation (URSA) Dataset, a synthetic dataset containing upwards of 1,000,000 images.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV EDFace-Celeb-1M https://github.com/HDCVLab/EDFace-Celeb-1M	2021	https://github.com/HDCVLab/EDFace-Celeb-1M	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Facebook Casual Conversations Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Casual Conversations is composed of over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models.	2021	Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Casual Conversations is composed of over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
NLP IBM CodeNet A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.	2021	A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.	Apache Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
CV Mapillary Vistas 2.0 The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.	2021	The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
Audio Spotify Podcast Dataset The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description.	2021	The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
Self-driving Waymo Open Motion Dataset With object trajectories and corresponding 3D maps for over 100,000 segments, each 20 seconds long and mined for interesting interactions, our new motion dataset contains more than 570 hours of unique data.	2021	With object trajectories and corresponding 3D maps for over 100,000 segments, each 20 seconds long and mined for interesting interactions, our new motion dataset contains more than 570 hours of unique data.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Facebook TextOCR TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.	2021	TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
Audio Timers and Such v1.0 Contains spoken English commands for setting timers, setting alarms, unit conversions, and simple math. The dataset contains around ~2,200 spoken audio commands from 95 speakers, representing 2.5 hours of continuous audio.	2021	Contains spoken English commands for setting timers, setting alarms, unit conversions, and simple math. The dataset contains around ~2,200 spoken audio commands from 95 speakers, representing 2.5 hours of continuous audio.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
QA CaseHOLD CaseHOLD contains 53,000 multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, which could be cited.	2021	CaseHOLD contains 53,000 multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, which could be cited.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
NLP Contract Understanding Dataset Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review in connection with a corporate transaction, including mergers & acquisitions, etc.	2021	Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review in connection with a corporate transaction, including mergers & acquisitions, etc.	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
CV WebFace260M WebFace260M is a new million-scale face benchmark, which is constructed for the research community towards closing the data gap behind the industry.	2021	WebFace260M is a new million-scale face benchmark, which is constructed for the research community towards closing the data gap behind the industry.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
NLP Danish Gigaword A billion-word corpus of Danish text, freely distributed with attribution.	2021	A billion-word corpus of Danish text, freely distributed with attribution.	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
Self-driving ONCE Dataset The ONCE dataset is a large-scale autonomous driving dataset with 2D&3D object annotations. Includes 1 Million LiDAR frames, 7 Million camera images.	2021	The ONCE dataset is a large-scale autonomous driving dataset with 2D&3D object annotations. Includes 1 Million LiDAR frames, 7 Million camera images.	CC-BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
CV ACDC Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. It comprises a large set of 4006 images which are evenly distributed between fog, nighttime, rain, and snow.	2021	Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. It comprises a large set of 4006 images which are evenly distributed between fog, nighttime, rain, and snow.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV SkyCam A Dataset of Sky Images and their Irradiance values. SkyCam dataset is a collection of sky images from a variety of locations with diverse topological characteristics (Swiss Jura, Plateau and Pre-Alps regions), from both single and stereo camera settings coupled with a high-accuracy pyranometers. The dataset was collected with a high frequency with a data sample every 10 seconds.	2021	A Dataset of Sky Images and their Irradiance values. SkyCam dataset is a collection of sky images from a variety of locations with diverse topological characteristics (Swiss Jura, Plateau and Pre-Alps regions), from both single and stereo camera settings coupled with a high-accuracy pyranometers. The dataset was collected with a high frequency with a data sample every 10 seconds.	CC-BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
CV LandCover.ai A dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.	2021	A dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Deep Fakes Dataset A dataset of “in the wild” portrait videos. The videos are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totaling up to 142 videos, 32 minutes, and 17 GBs.	2021	A dataset of “in the wild” portrait videos. The videos are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totaling up to 142 videos, 32 minutes, and 17 GBs.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
Self-driving 4Seasons A novel dataset covering seasonal and challenging perceptual conditions for autonomous driving.	2020	A novel dataset covering seasonal and challenging perceptual conditions for autonomous driving.	CC-BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
CV Canadian Building Footprints This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories.	2020	This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories.	ODbl You are free: to share, create and adapt as long as you attribute, share-alike and keep open.
Medical MedMNIST MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge.	2020	MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge.	Various The dataset contains data from several sources, check the links on the website for individual licenses
CV Cube++ Cube++ is a novel dataset collected for computational color constancy. It has 4890 raw 18-megapixel images, each containing a SpyderCube color target in their scenes, manually labelled categories, and ground truth illumination chromaticities.	2020	Cube++ is a novel dataset collected for computational color constancy. It has 4890 raw 18-megapixel images, each containing a SpyderCube color target in their scenes, manually labelled categories, and ground truth illumination chromaticities.	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
CV SYSU-30k Large-scale Person Re-ID Dataset. SYSU-30k contains 29,606,918 images.	2020	Large-scale Person Re-ID Dataset. SYSU-30k contains 29,606,918 images.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
CV Smithsonian Open Access Smithsonian Open Access, where you can download, share, and reuse millions of the Smithsonian’s images—right now, without asking. With new platforms and tools, you have easier access to more than 3 million 2D and 3D digital items.	2020	Smithsonian Open Access, where you can download, share, and reuse millions of the Smithsonian’s images—right now, without asking. With new platforms and tools, you have easier access to more than 3 million 2D and 3D digital items.	CC-0 CC-0 - No Copyright
CV Objectron The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. Includes 15000 annotated videos and 4M annotated images.	2020	The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. Includes 15000 annotated videos and 4M annotated images.	C-UDA-1.0 Computational Use of Data Agreement (C-UDA): - data that is assembled from lawfully accessed, publicly available sources to be used for computational analysis.
Medical MedICaT MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Consists of: 217,060 figures from 131,410 open access papers, 7507 subcaption and subfigure annotations for 2069 compound figures, Inline references for ~25K figures in the ROCO dataset.	2020	MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Consists of: 217,060 figures from 131,410 open access papers, 7507 subcaption and subfigure annotations for 2069 compound figures, Inline references for ~25K figures in the ROCO dataset.	CC-BY-NC-ND 4.0 Attribution-NonCommercial-NoDerivs International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, NoDerivs - if you make changes, you may not distribute the modified material.
NLP CLUE benchmark CLUE: A Chinese Language Understanding Evaluation Benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.	2020	CLUE: A Chinese Language Understanding Evaluation Benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.	Various The dataset contains data from several sources, check the links on the website for individual licenses
CV Ruralscapes Dataset Ruralscapes Dataset for Semantic Segmentation in UAV Videos. Ruralscapes is a dataset with 20 high quality (4K) videos portraying rural areas.	2020	Ruralscapes Dataset for Semantic Segmentation in UAV Videos. Ruralscapes is a dataset with 20 high quality (4K) videos portraying rural areas.	CC-BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.
CV Fashionpedia Fashionpedia is a dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.	2020	Fashionpedia is a dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.	CC BY 4.0 Attribution 4.0 International (CC BY 4.0) - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit.
NLP Social Bias Inference Corpus Social Bias Inference Corpus (SBIC contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.	2020	Social Bias Inference Corpus (SBIC contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
Medical BrixIA COVID19 severity score assessment project and database. 4703 CXR of COVID19 patients.	2020	COVID19 severity score assessment project and database. 4703 CXR of COVID19 patients.	Non-commercial Can only be used for research and educational purposes. Commercial use is prohibited.
Medical MaskedFace-Net MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) based on the dataset Flickr-Faces-HQ (FFHQ).	2020	MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) based on the dataset Flickr-Faces-HQ (FFHQ).	CC-BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike International - You are free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions.