Public Datasets

Dataset Finder
Image Datasets
Text Datasets
Medical Datasets
Audio Datasets
IOT Datasets
Recommender Datasets
Anomaly Data
Text Classification
Machine Translation
Plant Disease
Multivariate data
Time Series
Graphs
Document Analysis
General Classifications
General Regression
eSports Datasets
Synthetic Datasets

Dataset Finder

Roboflow - Computer Vision Datasets: Roboflow hosts free public computer vision datasets in many popular formats (including CreateML JSON, COCO JSON, Pascal VOC XML, YOLO v3, and Tensorflow TFRecords.
Machine Learning datasets: A list of the biggest machine learning datasets from across the web..
Kaggle: A data science site that contains a variety of externally-contributed interesting datasets. You can find all kinds of niche datasets in its master list, from ramen ratings to basketball data to and even Seattle pet licenses.
UCI Machine Learning Repository: One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets. Although the data sets are user-contributed, and thus have varying levels of cleanliness, the vast majority are clean. You can download data directly from the UCI Machine Learning repository, without registration.
Google Dataset Search: Thanks to Google’s acquisition of Schema.org the metadata for datasets is now recognized by Google’s knowledge graph. This is in beta
Google Public Datasets: Public Datasets on Google Cloud Platform makes it easy for users to access and analyze data in the cloud. These datasets are freely hosted and accessible using a variety of data warehouse and analytics software, from open source Apache Spark to cutting edge Google technologies like Google BigQuery and Google Cloud Dataflow. From structured genomic or encyclopedic data to unstructured climate data, Public Datasets provide a playground for those new to big data and data analysis and a powerful repository for skilled researchers. You can also integrate with your application to add valuable insights for your users. Whatever your use case, these datasets are freely available on GCP. This page will also lead you to some special subsets like: Google BigQuery Public Datasets (https://cloud.google.com/bigquery/public-data/) (the first terabyte download is free but charges apply after that). Google Genomics Public Datasets Geo Imagery Datasets
Microsoft Research Open Data: A collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences. Download or copy directly to a cloud-based Data Science Virtual Machine for a seamless development experience. Categpries include Biology, Engineering, Healthcare, Mathematics, Social Science, Computer Science, Environmental Science, Information Science and Physics. MS Research Open Data doesn’t search the entire web, but rather makes available 53 previously proprietary datasets all in the realm of deep learning, both text/speech and image.
Open Data on AWS: The Registry of Open Data on AWS makes it easy to find datasets made publicly available through AWS services. Browse available data and learn how to register your own datasets at: https://registry.opendata.aws
Academic Torrents: A distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds
Github Awesome Public Datasets: This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. 565 data sets.
Figure Eight: This commercial provider of human-in-the-loop data currently offers only eight datasets. The reason for their inclusion here is unique. Figure Eight makes its reputation by providing accurate data, especially enhancing the accuracy of its client’s data.
Skymind: Skymind is a commercial platform to rapidly prototype, deploy, maintain, and retrain machine learning models. They offer 101 datasets from a variety of sources that cover Natural-Image, Geospatial, Facial, Video, Text , Question answering, Sentiment, Recommendation and ranking systems, Networks and Graphs, Speech Datasets, Symbolic Music, Health & Biology, and Government & statistical data sets.
Yahoo Webscope Program: The Yahoo Webscope Program is a reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists. All datasets have been reviewed to conform to Yahoo’s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you.
Stanford Large Network Dataset Collection:
- Social networks : online social networks, edges represent interactions between people
- Networks with ground-truth communities : ground-truth network communities in social and information networks
- Communication networks : email communication networks with edges representing communication
- Citation networks : nodes represent papers, edges represent citations
- Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper)
- Web graphs : nodes represent webpages and edges are hyperlinks
- Amazon networks : nodes represent products and edges link commonly co-purchased products
- Internet networks : nodes represent computers and edges communication
- Road networks : nodes represent intersections and edges roads connecting the intersections
- Autonomous systems : graphs of the internet
- Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)
- Location-based online social networks : social networks with geographic check-ins
- Wikipedia networks, articles, and metadata : talk, editing, voting, and article data from Wikipedia
- Temporal networks : networks where edges have timestamps
- Twitter and Memetracker : memetracker phrases, links and 467 million Tweets
- Online communities : data from online communities such as Reddit and Flickr
- Online reviews : data from online review systems such as BeerAdvocate and Amazon
- User actions : actions of users on social platforms.
- Face-to-face communication networks : networks of face-to-face (non-online) interactions
- Graph classification datasets : disjoint graphs from different classes
Stanford Biomedical Network Dataset Collection:
Group Lens: GroupLens Research has collected and made available several datasets. MovieLens, WikiLens, Book-Crossing, Jester EachMovie, HetRec2011, Serendioity 2018, Personality 2018
Edinburgh Data Share: Edinburgh DataShare is a digital repository of research data produced at the University of Edinburgh, hosted by Information Services. Edinburgh University researchers who have produced research data associated with an existing or forthcoming publication, or which has potential use for other researchers, are invited to upload their dataset for sharing and safekeeping. A persistent identifier and suggested citation will be provided.
Dataturks: Data Annotation Platform. Image Bounding, Document Annotation, NLP and Text Annotations. #HumanInTheLoop #AI, #TrainingData for #MachineLearning.
Visualdata.io: Discover Computer Vision Datasets
ScholarBank@NUS :
Keel Repository for classification, regression and time series : Provides to the machine learning researchers a set of benchmarks to analyze the behavior of the learning methods. Concretely, it is possible to find benchmarks already formatted in KEEL format for classification (such as standard, multi instance or imbalanced data), semi-supervised classification, regression, time series and unsupervised learning. Also, a set of low quality data benchmarks is maintained in the repository
Outlier Detection DataSets (ODDS): a large collection of outlier detection datasets with ground truth (if available).
UEA & UCR Time Series Classification Repository: This website is an ongoing project to develop a comprehensive repository for research into time series classification. If you use the results or code, please cite the paper “Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large and Eamonn Keogh, The Great Time Series Classification Bake Off: a Review and Experimental Evaluation of Recent Algorithmic Advances, Data Mining and Knowledge Discovery, 31(3), 2017”. Paper Link, Bibtex Link. We are in the process of updating all the results for the new dataset.
Format: various
Default task: various
Network Repository: The first interactive data and network data repository with real-time visual analytics. Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations in 30+ domains (from biological to social network data). This large comprehensive collection of network graph data is useful for making significant research findings as well as benchmark network data sets for a wide variety of applications and domains (e.g., network science, bioinformatics, machine learning, data mining, physics, and social science) and includes relational, attributed, heterogeneous, streaming, spatial, and time series network data as well as non-relational machine learning data. All graph data sets are easily downloaded into a standard consistent format. We also have built a multi-level interactive graph analytics engine that allows users to visualize the structure of the network data as well as macro-level graph data statistics as well as important micro-level network properties of the nodes and edges. Check out GraphVis: the interactive visual network mining and machine learning tool.
Format: various
Default task: various
Deep Learning Datasets: Collated list of image and video datasets.
Format: various
Default task: various
PhysioNet: This page displays an alphabetical list of all the databases on PhysioNet. To search content on PhysioNet, visit the search page. Enter the search terms, add a filter for resource type if needed, and select how you would like the results to be ordered (for example, by relevance, by date, or by title).
Each project is made available under one of the following access policies:
Open Access: Accessible by all users, with minimal restrictions on reuse.
Restricted Access: Accessible by registered users who sign a Data Use Agreement.
Credentialed Access: Accessible by registered users who complete the credentialing process and sign a Data Use Agreement.
Format: Various
Default task: Various
Bifrost Visual Datasets: Collated list of image datasets.
Format: various
Default task: various
Awesome list of datasets in 100+ categories: With an estimated 44 zettabytes of data in existence in our digital world today and approximately 2.5 quintillion bytes of new data generated daily, there is a lot of data out there you could tap into for your data science projects. It’s pretty hard to curate through such a massive universe of data, but this collection is a great start. Here, you can find data from cancer genomes to UFO reports, as well as years of air quality data to 200,000 jokes. Dive into this ocean of data to explore as you learn how to apply data science techniques or leverage your expertise to discover something new.
Format: various
Default task: various
帕依提提: Open dataset Format: various
Default task:

Image Datasets

CVonline: Image Databases: This is a collated list of image and video databases that people have found useful for computer vision research and algorithm evaluation.
Cove Computer Vision Exchange: COVE is an online repository for computer vision datasets sponsored by the Computer Vision Foundation. It is intended to aid the computer vision research community and serve as a centralized reference for all datasets in the field. If you are a researcher with a dataset not currently in COVE, please help make this site as comprehensive a resource as possible for the community and add it to the database!.
Diversity in Faces Dataset: The Diversity in Faces(DiF)is a large and diverse dataset that seeks to advance the study of fairness and accuracy in facial recognition technology. The first of its kind available to the global research community, DiF provides a dataset of annotations of 1 million human facial images.
ADE20K: Semantic Understanding of Scenes through ADE20K Dataset. Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso and Antonio Torralba. International Journal on Computer Vision (IJCV).
MURA (healthcare): MURA (musculoskeletal radiographs) is a large dataset of bone X-rays. Algorithms are tasked with determining whether an X-ray study is normal or abnormal.
Labelme: A large dataset of annotated images.
ImageNet: The de-facto image dataset for new algorithms. Is organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images.
Surrey Audio-Visual Expressed Emotion (SAVEE) Database: Surrey Audio-Visual Expressed Emotion (SAVEE) database has been recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. The data were recorded in a visual media lab with high quality audio-visual equipment, processed and labeled. To check the quality of performance, the recordings were evaluated by 10 subjects under audio, visual and audio-visual conditions. Classification systems were built using standard features and classifiers for each of the audio, visual and audio-visual modalities, and speaker-independent recognition rates of 61%, 65% and 84% achieved respectively.

Facial Recognition

FERET (facial recognition technology): 11338 images of 1199 individuals in different positions and at different times.
CMU Pose, Illumination, and Expression (PIE): 41,368 color images of 68 people in 13 different poses. Images labeled with expressions. (Pay for shipping)
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities. Files labelled with expression. Perceptual validation ratings provided by 319 raters.
SCFace: Color images of faces at various angles. Location of facial features extracted. Coordinates of features given.
YouTube Faces DB: Videos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames. Identity of those appearing in videos and descriptors.
300 videos in-the-Wild: The 300 Videos in the Wild (300-VW) dataset contains videos for facial landmarks tracking. Specifically, this dataset includes 114 lengthy videos (approx. 1 min each) with 68 markup landmark points annotated densely.
Grammatical Facial Expressions Dataset: Grammatical Facial Expressions from Brazilian Sign Language.
CMU Face Images Data Set: Images of faces. Each person is photographed multiple times to capture different expressions.
Yale Face Database: The Yale Face Database (size 6.4MB) contains 165 grayscale images in GIF format of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink.
Cohn-Kanade AU-Coded Expression Database: Large database of images with labels for expressions.
FaceScrub: Images of public figures scrubbed from image searching.
Skin Segmentation Data Set : Randomly sampled color values from face images.
Bosphorus: The Bosphorus Database is intended for research on 3D and 2D human face processing tasks including expression recognition, facial action unit detection, facial action unit intensity estimation, face recognition under adverse conditions, deformable face modeling, and 3D face reconstruction. There are 105 subjects and 4666 faces in the database.
UOY 3D-Face: The UoY 3D face dataset is a set of 3D images of the human face and consists of around 5000 3D images of approximately 350 people (15 models each). The data collection was planned and implemented by Tom Heseltine during his PhD in 3D Face Recognition at the Department of Computer Science, University of York.
Biometrics Ideal Test: Biometrics Ideal Test (or BIT for short) is a website for biometric database sharing and algorithm evaluation. Our mission is to facilitate biometrics research and development by providing quality public services to biometric researchers. You are welcome to register an account in BIT so that you can download publicly available iris, face, fingerprint, palmprint, multi-spectral palm and handwriting.
BU-3DFE: neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.
Face Recognition Grand Challenge Dataset: Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.
3D-RMA: Up to 100 subjects, expressions mostly neutral. Several poses as well.
Specs on Faces: A collection of 42,592 images for 112 persons (66 males and 46 females) who wear glasses under different illumination conditions.
Format: Images
Default task: Gender classification - face detection - eyeglasses detection - emotion recognition - facial landmark detection
UCSD Anomaly Detection Dataset: The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd density in the walkways was variable, ranging from sparse to very crowded.
Format: Images
Default task: Anomaly Detection
Anomalous Behavior Data Set: This website provides a data set for anomalous behaviour detection in video. The data set contains 8 image sequences that depict a wide range of challenging scenarios, including: illumination effects, scene clutter, variable target appearance, rapid motion and camera jitter. All sequences are available with manually constructed ground truth that identifies anomalous behaviour relative to a training portion of the video. Also provided is software for groundtruth construction and subsequent evaluation.
Format: Videos, Images
Default task: Anomaly Detection
IMDB-WIKI: IMDB and Wikipedia face images with gender and age labels.

Action Recognition

Human Motion DataBase (HMDB51): 51 action categories, each containing at least 101 clips, extracted from a range of sources.
TV Human Interaction Dataset: Consists of 300 video clips collected from over 20 different TV shows and containing 4 interactions: hand shakes, high fives, hugs and kisses, as well as clips that don’t contain any of the interactions.
UT Interaction: People acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip.
UT Kinect: 10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting.
Berkeley Multimodal Human Action Database (MHAD): Recordings of a single person performing 12 actions
UCF101 – Action Recognition Data Set: Self described as “a dataset of 101 human actions classes from videos in the wild.” Dataset is large with over 27 hours of video.
THUMOS Dataset: Large video dataset for action classification.
Activitynet: A Large-Scale Video Benchmark for Human Activity Understanding
MSP-AVATAR: The MSP-Avatar corpus is a motion capture database which explores the role of discourse functions in non-verbal human interactions. This database comprises three sessions of recordings of spontaneous dyadic interactions between six actors. The scenarios are designed to elicit different types of discourse-related gestures in the actors.
LILiR Twotalk Corpus: The LILiR Twotalk corpus is comprised of four conversations of two person (dyadic) conversations recorded with minimal constraints on participant behavior. Four conversations of 12 minutes were recorded with two PAL progressive scan cameras, one microphone and eight subjects. Annotation was performed by multiple annotators from various cultures on 527 clips, extracted from the longer videos. The conversation participants were only instructed to be seated and to talk.
MEXAction2: Video dataset for action localization and spotting
STAIR Actions Videos: A Large-Scale Video Dataset of Everyday Human Actions. January 30, 2019: STAIR Actions v1.1 is released! STAIR Actions is a video dataset consisting of 100 everyday human action categories. Each category contains around 900 to 1800 trimmed video clips. Each clip lasts 5 to 6 seconds. Clips are taken from YouTube video or made by crowdsource workers.
Format:
Default task:
Ref:Tutorial
A2D Action Recognition : Can humans fly? Emphatically no. Can cars eat? Again, absolutely not. Yet, these absurd inferences result from the current disregard for particular types of actors in action understanding. There is no work we know of on simultaneously inferring actors and actions in the video, not to mention a dataset to experiment with. A2D hence marks the first effort in the computer vision community to jointly consider various types of actors undergoing various actions. To be exact, we consider seven actor classes (adult, baby, ball, bird, car, cat, and dog) and eight action classes (climb, crawl, eat, fly, jump, roll, run, and walk) not including the no-action class, which we also consider. The A2D has 3782 videos with at least 99 instances per valid actor-action tuple and videos are labeled with both pixel-level actors and actions for sampled frames. The A2D dataset serves as a novel large-scale testbed for various vision problems: video-level single- and multiple-label actor-action recognition, instance-level object segmentation/co-segmentation, as well as pixel-level actor-action semantic segmentation to name a few.
Format:
Default task:
Ref:Tutorial
KTH Action Recognition : The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average.
Format:
Default task:
Ref:Tutorial

Object Detection and Recognition

Visual Genome: Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
DAVIS: Densely Annotated VIdeo Segmentation: 150 video sequences containing 10459 frames with a total of 376 objects annotated.
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects: 30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object.
Berkeley 3-D Object Dataset: A quality depth sensor, the Microsoft Kinect, is now in millions of homes. Yet robust household object detection is still not a reality. To get there, we are collecting a massive, crowd-sourced, and challenging 3-D object dataset.
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500): 500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.
Microsoft Common Objects in Context (COCO): Complex everyday scenes of common objects in their natural context.
SUN Database: Very large scene and object recognition database.
Open Images: A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes. 15,851,536 boxes on 600 categories
2,785,498 instance segmentations on 350 categories
3,284,280 relationship annotations on 1,466 relationships
675,155 localized narratives
59,919,574 image-level labels on 19,957 categories
Extension - 478,000 crowdsourced images with 6,000+ categories
TV News Channel Commercial Detection Dataset: TV commercials and news broadcasts.
Caltech 101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc ‘Aurelio Ranzato. The size of each image is roughly 300 x 200 pixels.
Caltech-256: Large dataset of images for object classification.
SIFT10M Data Set : SIFT features of Caltech-256 dataset.
Cityscapes Dataset: The Cityscapes Dataset focuses on semantic understanding of urban street scenes.
PASCAL VOC Dataset: Large number of images for classification tasks.
CIFAR-10 CIFAR-100 Dataset: Many small, low-resolution, images of 10 classes of objects.
CINIC-10: CINIC-10 is an augmented extension of CIFAR-10. It contains the images from CIFAR-10 (60,000 images, 32x32 RGB pixels) and a selection of ImageNet database images (210,000 images downsampled to 32x32). It was compiled as a ‘bridge’ between CIFAR-10 and ImageNet, for benchmarking machine learning applications. It is split into three equal subsets - train, validation, and test - each of which contain 90,000 images.
Fashion MNIST: A MNIST-like fashion product database
notMNIST dataset: Some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts.
The German Traffic Sign Detection Benchmark: Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.
KITTI Vision Benchmark Suite!: Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.
Linnaeus 5 dataset: Images of 5 classes of objects.
FieldSAFE: Multi-modal dataset for obstacle detection in agriculture including stereo camera, thermal camera, web camera, 360-degree camera, lidar, radar, and precise localization.
11K Hands: 11,076 hand images (1600 x 1200 pixels) of 190 subjects, of varying ages between 18 – 75 years old, for gender recognition and biometric identification.
CORe50: Specifically designed for Continuous/Lifelong Learning and Object Recognition, is a collection of more than 500 videos (30fps) of 50 domestic objects belonging to 10 different categories.

Handwriting and character recognition

Artificial Characters Dataset: Artificially generated data describing the structure of 10 capital English letters.
Letter Dataset: Upper case printed letters.
Character Trajectories Dataset: Labeled samples of pen tip trajectories for people writing simple characters.
Chars74K Dataset: Character recognition in natural images of symbols used in both English and Kannada.
UJI Pen Characters Dataset: Isolated handwritten characters
Gisette Dataset: Handwriting samples from the often-confused 4 and 9 characters.
MNIST database: The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
Optical Recognition of Handwritten Digits Dataset: Normalized bitmaps of handwritten data.
Pen-Based Recognition of Handwritten Digits Dataset: Handwritten digits on electronic pen-tablet.
Semeion Handwritten Digit Dataset : 1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values.Then each pixel of each image was scaled into a bolean (1/0) value using a fixed threshold.
Each person wrote on a paper all the digits from 0 to 9, twice. The commitment was to write the digit the first time in the normal way (trying to write each digit accurately) and the second time in a fast way (with no accuracy).
Semeion Handwritten Digit Dataset : 1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values.Then each pixel of each image was scaled into a bolean (1/0) value using a fixed threshold.
Each person wrote on a paper all the digits from 0 to 9, twice. The commitment was to write the digit the first time in the normal way (trying to write each digit accurately) and the second time in a fast way (with no accuracy).
HASYv2: HASY contains 32px x 32px images of 369 symbol classes. In total, HASY contains over 150,000 instances of handwritten symbols.
Noisy Handwritten Bangla Datase: Includes Handwritten Numeral Dataset (10 classes) and Basic Character Dataset (50 classes), each dataset has three types of noise: white gaussian, motion blur, and reduced contrast.

Aerial images

Inria Aerial Image Labeling Dataset: The Inria Aerial Image Labeling addresses a core topic in remote sensing: the automatic pixelwise labeling of aerial imagery
Aerial Image Segmentation Dataset: 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.
KIT AIS Data Set: Multiple labeled training and evaluation datasets of aerial images of crowds (Vehicles and people).
Wilt Dataset: This data set contains some training and testing data from a remote sensing study by Johnson et al. (2013) that involved detecting diseased trees in Quickbird imagery. There are few training samples for the ‘diseased trees’ class (74) and many for ‘other land cover’ class (4265).
Forest type mapping Dataset: Satellite imagery of forests in Japan.
Format: text
Default task: Classification
Overhead Imagery Research Dataset: Overhead Imagery Research Data Set (OIRDS) - an annotated data library & tools to aid in the development of computer vision algorithms.
Format: Images, text
Default task: Classification
SpaceNet: SpaceNet is a corpus of commercial satellite imagery and labeled training data.
Format: Images
Default task: Classification, Object Identification
UC Merced Land Use Dataset: These images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the US.
Format: Image
Default task: Classification
SAT-4 and SAT-6 Airborne Dataset: Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. SAT-4 has four broad land cover classes, includes barren land, trees, grassland and a class that consists of all land cover classes other than the above three. SAT-6 has six broad land cover classes, includes barren land, trees, grassland, roads, buildings and water bodies.
Format: Image
Default task: Classification
Satellite imagery: Free data, mostly from NASA and ESA, can be found in specialized catalogs, where users search based on the area of interest, required resolution, or capture date. The most popular satellites with free data in visible spectrum are Landsat-8 and Sentinel-2.
Format: Image
Default task: Classification
Awesome Satellite Imagery Datasets: List of aerial and satellite imagery datasets with annotations for computer vision and deep learning. Newest datasets at the top of each category (Instance segmentation, object detection, semantic segmentation, scene classification, other).
Format: Image
Default task: various
Road Segmentation in Satellite Imagery: Goal — To segment road lines in satellite imagery. Application — Helps in urban planning and monitoring roadways. Details — 1K+ images with associated instance masks for detecting different road regions
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector
Traversable region segmentation in Synthetically generated lunar imagery: Goal — To segment out rocks and find traversable region in lunar imagery. Application — Essential element in autonomous rovers’ path planning. Details — 10K+ images with associated instance masks for detecting different rocks and flat ground
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector
Cars and Swimming Pools Detection in Satellite Imagery: Goal — To detect vehicles and pools in satellite imagery. Application — This forms a crucial part in property tax estimation. Details — 3.5K+ images with 5K+ annotations labels on cars and pools
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using cornernet-lite pipeline
Roads and Residential area segmentation in Aerial Imagery: Goal — To segment road and residential areas in satellite imagery. Application — This forms a crucial part in property tax estimation. Details — 100 very high resolution images with segmentation masks
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector
Water Body Segmentation in satellite imagery: Goal — To segment water bodies in satellite imagery. Application — Very important to understand how water bodies change and evolve over time. Details — 100 very high resolution images with segmentation masks
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector
DeepGlobe Land Cover Classification Challenge: Automatic categorization and segmentation of land cover is of great importance for sustainable development, autonomous agriculture, and urban planning. We would like to introduce the challenge of automatic classification of land cover types. This problem is defined as a multi-class segmentation task to detect areas of urban, agriculture, rangeland, forest, water, barren, and unknown. The evaluation will be based on the accuracy of the class labels. Format:
Default task:
Ref:Guide
Oil Tanks Datasets: Goal — To detect tanks in satellite imagery. Application — To keep track of oil tanks. Details — 10K+ images with 10K+ annotations.
Format:
Default task:
Ref:How to utilize the dataset and build a custom classifier using retinanet pipeline

Thermal images

Lincoln Centre for Autonomous Systems: This dataset is recorded for evaluating thermal-based physiological monitoring algorithms that can measure respiration and heart beat rate. The dataset contains thermal images of different human faces acquired in the Lincoln Centre for Autonomous Systems (L-CAS) at the University of Lincoln, UK. Data were recorded into different rosbag files, each corresponding to a person. The thermal camera recorded each person for two minutes with a frequency of 27Hz. People were asked to keep static in the first one minute, then move their head up and down, forward and back, turning right and left, each action was held for 10 seconds.
Format: ROSBAG
Default task: various
Tufts Face Database Thermal Cropped: Tufts Face Database is the most comprehensive, large-scale (over 10,000 images, 74 females + 38 males, from more than 15 countries with an age range between 4 to 70 years old) face dataset that contains 6 image modalities: visible, near-infrared, thermal, computerized sketch, a recorded video, and 3D images. This webpage/dataset contains the Tufts Face Database three-dimensional (3D) images. The other datasets are made available through separate links by the user.
Format: Image
Default task: various
FREE FLIR Thermal Dataset for Algorithm Training: The FLIR starter thermal dataset enables developers to start training convolutional neural networks (CNN), empowering the automotive community to create the next generation of safer and more efficient ADAS and driverless vehicle systems using cost-effective thermal cameras from FLIR.
Format: Image
Default task: various
Visible-Infrared Database: This database was developed by SMT/COPPE/Poli/UFRJ and IME-Instituto Militar de Engenharia within the CAPES/Pró-Defesa Program, in a partnership with IPqM-Instituto de Pesquisa da Marinha. The infrared and visible sequences are synchronized and registred.
Format: Image
Default task: image fusion
OTCBVS Benchmark Dataset Collection: This is a publicly available benchmark dataset for testing and evaluating novel and state-of-the-art computer vision algorithms. Several researchers and students have requested a benchmark of non-visible (e.g., infrared) images and videos. The benchmark contains videos and images recorded in and beyond the visible spectrum and is available for free to all researchers in the international computer vision communities. Also it will allow a large spectrum of IEEE and SPIE vision conference and workshop participants to explore the benefits of the non-visible spectrum in real-world applications, contribute to the OTCBVS workshop series, and boost this research field significantly. This effort was initiated by Dr. Riad I. Hammoud in 2004. It was hosted at Ohio State University and managed by Dr. James W. David until 2013. It is currently managed by Dr. Guoliang Fan at Oklahoma State University.
Format: Image, Thermal Default task: Various.
FMTV - Laval Face Motion and Time-Lapse Video Databasen: The ULFMT was gathered at MiViM research chair of Canada at Université Laval during the PhD study of Dr. Reza Shoja Ghiass in the laboratory of Prof. Hakim Bendada and Prof. Xavier Maldague. The database has been gathered by an Indigo Phoenix Thermal Camera between 2010 and 2014.
Format: Thermal Default task: Various
VAIS: VAIS contains simultaneously acquired unregistered thermal and visible images of ships acquired from piers. It is suitable for multi-modal object classification research..
Format: Thermal Default task: Various
Iran Thermography: Specialized reference of thermography.
Format: Thermal Default task: Various
Carl Database: Visible and thermal images have been acquired using a thermographic camera TESTO 880-3, equipped with an uncooled detector with a spectral sensitivity range from 8 to 14 μm and provided with a germanium optical lens, and an approximate cost of 8.000 EUR. For the NIR a customized Logitech Quickcam messenger E2500 has been used, provided with a Silicon based CMOS image sensor with a sensibility to the overall visible spectrum and the half part of the NIR (until 1.000 nm approximately) with a cost of approx. 30 EUR. We have replaced the default optical filter of this camera by a couple of Kodak daylight filters for IR interspersed between optical and sensor. They both have similar spectrum responses and are coded as wratten filter 87 and 87C, respectively. In addition, we have used a special purpose printed circuit board (PCB) with a set of 16 infrared leds (IRED) with a range of emission from 820 to 1.000 nm in order to provide the required illumination.
Format: Thermal Default task: Various
Person Detection using Infrared Images: Goal — To detect people in infrared imagery, Application — Autonomous vehicles are equipped with infrared cams to detect objects in adverse conditions, Details — 30 video sequences with 1K+ annotations Format: Thermal Default task: Various Ref: How to utilize the dataset and build a custom detector using Mx-Rcnn pipeline

Tiger Detection Dataset (Subsampled from OpenImages): Goal — To detect tigers in natural and drone images. Application — To monitor endangered species. Details — 2K+ images with 4k+ annotations.
Format:
Default task: Ref:How to utilize the dataset and build a custom detector using cornernet-lite pipeline
Monkey detection dataset :
Format:
Default task:
Ref:Tutorial
Zebras and Giraffes Detection Dataset: Goal — To detect zebra and giraffe species in natural and drone images. Application — To monitor endangered species. Details — 5K+ images with 5k+ annotations.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using efficientdet-d3 pipeline
Caltech Cameratrap Dataset: Goal — To detect animals in trap camera types images. Application — To monitor endangered species. Details — 10K+ images with 8k+ annotations.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using retinanet pipeline
Another tutorial
Elephant Detection Dataset (Subsampled from COCO dataset): Goal — To detect elephant species in natural and drone images. Application — To monitor endangered species. Details — 5K+ images with 5k+ annotations.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using mmdet-maskrcnn
: Goal — To detect vehicles and pools in satellite imagery. Application — This forms a crucial part in property tax estimation. Details — 3.5K+ images with 5K+ annotations labels on cars and pools
Format:
Default task:
Ref:

Underwater datasets

Detecting Sea Turtles in the wild: Goal — To detect sea turtles in underwater images. Application — To monitor endangered species. Details — 5K+ images with 5k+ annotations.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using efficientdet
Underwater trash detection Dataset: Goal — To detect marine trash. Application — To monitor and control marine waste issue.Details — 2K+ images with 5k+ annotations.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using efficientdet
trash segmentation dataset: Goals: Segmentation
Format:
Default task: Segmentation Ref:Tutorial
SUIM underwater object detection dataset: Goal — To segment underwater objects. Application — Path planning for autonomous underwater vehicles, track divers and monitor marine species. Details — 1.5K+ images with 1.5k+ annotation masks.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector
Brackish underwater fish recognition dataset: Goal — To detect marine species in underwater imagery. Application — To monitor marine species. Details — 89 videos to detect fish, crab, shrimp, jellyfish, starfish
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using mmdet — faster rcnn pipeline

Document analysis datasets

Document Layout Detection Dataset: Goal — To detect document layout for further analysis. Application — Essential to segment images into different parts so that certain rule based nlp and text recognition can further be applied. Details — 5K+ images with 10k+ annotations with labels such as paragraphs, images, headers.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using mx-rcnn
IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents: We introduce a new dataset for graphical object detection in business documents, more specifically annual reports. This dataset, IIIT-AR-13K, is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13K annotated page images with objects in five different popular categories — table, figure, natural image, logo, and signature. This is the largest manually annotated dataset for graphical object detection. Annual reports created in multiple languages for several years from various companies bring high diversity into this dataset. We benchmark IIIT-AR-13K dataset with two state of the art graphical object detection techniques using Faster R-CNN [18] and Mask R-CNN [11] and establish high baselines for further research. Our dataset is highly effective as training data for developing practical solutions for graphical object detection in both business documents and technical articles. By training with IIIT-AR-13K, we demonstrate the feasibility of a single solution that can report superior performance compared to the equivalent ones trained with a much larger amount of data, for table detection. We hope that our dataset helps in advancing the research for detecting various types of graphical objects in business documents.
Format:
Default task:
Ref:Tutorial
Total-Text Dataset: Goal — To localize text in natural scenes. Application — Essential base component to recognize using OCR. Details — 1.5K+ images with 5K+ polygonal annotations
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using Text-Snake pipeline

Other images

Quantum simulations of an electron in a two dimensional potential well: Labelled images of raw input to a simulation of 2d Quantum mechanics
Format: Image
Default task: Regression
MPII Cooking Activities Dataset: Videos and images of various cooking activities.
Format: Labeled video, images, text
Default task: Classification
MPII Emo Dataset: Emotion Recognition from Embedded Bodily Expressions and Speech during Dyadic Interactions.
Format: Labeled video, images, text
Default task: Classification
FAMOS Dataset: 5,000 unique microstructures, all samples have been acquired 3 times with two different cameras..
Format: Labeled video, images, text
Default task: Classification
PharmaPack Dataset: 1,000 unique classes with 54 images per class
Format: Images and .mat files
Default task: Fine-grain classification
Stanford Dogs Dataset: Images of 120 breeds of dogs from around the world.
Format: Images, text
Default task: Fine-grain classification
The Oxford-IIIT Pet Dataset: 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation.
Format: Images, text
Default task: Classification, object detection
Corel Image Features Data Set: Database of images with features extracted.
Format: text
Default task: Classification, object detection
Online Video Characteristics and Transcoding Time Dataset: Transcoding times for various different videos and video properties.
Format: text
Default task: Regression
Microsoft Sequential Image Narrative Dataset (SIND): Dataset for sequential vision-to-language.
Format: Images, text
Default task: Visual storytelling
Caltech-UCSD Birds-200-2011 Dataset: Large dataset of images of birds.
Format: Images, text
Default task: Classification.
YouTube-8M: Large and diverse labeled video dataset.
Format: Video, text
Default task: Video classification.
YFCC100M: This YFCC100M dataset contains a list of photos and videos on Yahoo! Flickr, which are licensed under one of the Creative Commons copyright licenses.
Format: Video, Image, Text
Default task: Video and Image classification.
Discrete LIRIS-ACCEDE: Short videos annotated for valence and arousal.
Format: Video
Default task: Video emotion elicitation detection.
Continuous LIRIS-ACCEDE: Long videos annotated for valence and arousal while also collecting Galvanic Skin Response.
Format: Video
Default task: Video emotion elicitation detection.
MediaEval LIRIS-ACCEDE: Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.
Format: Video
Default task: Video emotion elicitation detection.
Leeds Sports Pose: Articulated human pose annotations in 2000 natural sports images from Flickr.
Format: Images plus .mat file labels
Default task: Human pose estimation.
Leeds Sports Pose Extended Training: Articulated human pose annotations in 10,000 natural sports images from Flickr.
Format: Images plus .mat file labels
Default task: Human pose estimation.
Leeds Sports Pose Extended Training: 6 different real multiple choice-based exams (735 answer sheets and 33,540 answer boxes) to evaluate computer vision techniques and systems developed for multiple choice test assessment systems.
Format: Images and .mat file labels Default task: Development of multiple choice test assessment systems.
Surveillance Videos: Real surveillance videos cover a large surveillance time (7 days with 24 hours each).
Format: Videos Default task: Data compression.
Can We See Photosynthesis?: 32 videos for eight live and eight dead leaves recorded under both DC and AC lighting conditions.
Format: Videos Default task: Liveness detection of plants.
Malaria Datasets: Hosts a repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity (https://peerj.com/articles/4568/).
Format: Image Default task: Classification.
The Cancer Imaging Archive : The image data in The Cancer Imaging Archive (TCIA) is organized into purpose-built collections. A collection typically includes studies from several subjects (patients). In some collections, there may be only one study per subject. In other collections, subjects may have been followed over time, in which case there will be multiple studies per subject. The subjects typically have a disease and/or particular anatomical site (lung, brain, etc.) in common.
Format: Image Default task: Various.
CMU Panoptic Dataset: CMU Panoptic Studio dataset is shared only for research purposes, and this cannot be used for any commercial purposes. The dataset or its modified version cannot be redistributed without permission from dataset organizers.
Format: Image Default task: Various.
OTCBVS Benchmark Dataset Collection: This is a publicly available benchmark dataset for testing and evaluating novel and state-of-the-art computer vision algorithms. Several researchers and students have requested a benchmark of non-visible (e.g., infrared) images and videos. The benchmark contains videos and images recorded in and beyond the visible spectrum and is available for free to all researchers in the international computer vision communities. Also it will allow a large spectrum of IEEE and SPIE vision conference and workshop participants to explore the benefits of the non-visible spectrum in real-world applications, contribute to the OTCBVS workshop series, and boost this research field significantly. This effort was initiated by Dr. Riad I. Hammoud in 2004. It was hosted at Ohio State University and managed by Dr. James W. David until 2013. It is currently managed by Dr. Guoliang Fan at Oklahoma State University.
Format: Image, Thermal Default task: Various.
The Quick, Draw! Dataset: The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located. You can browse the recognized drawings on quickdraw.withgoogle.com/data.
Format: bin, ndjson Default task: drawing sequence.
Pix2Pix Datasets: Datasets for Pix2Pix GAN
Format:
Default task: Pix2Pix GAN.
Human Foot Keypoint Dataset: Existing human pose datasets contain limited body part types. The MPII dataset annotates ankles, knees, hips, shoulders, elbows, wrists, necks, torsos, and head tops, while COCO also includes some facial keypoints. For both of these datasets, foot annotations are limited to ankle position only. However, graphics applications such as avatar retargeting or 3D human shape reconstruction require foot keypoints such as big toe and heel. Without foot information, these approaches suffer from problems such as the candy wrapper effect, floor penetration, and foot skate. To address these issues, a small subset of foot instances out of the COCO dataset is labeled using the Clickworker platform. It is split up with 14K annotations from the COCO training set and 545 from the validation set. A total of 6 foot keypoints are labeled. We consider the 3D coordinate of the foot keypoints rather than the surface position. For instance, for the exact toe positions, we label the area between the connection of the nail and skin, and also take depth into consideration by labeling the center of the toe rather than the surface.
Format:
Default task: Pix2Pix GAN.
Dataset for Affective States in E-Environments: The difference between real and virtual worlds is shrinking at an astounding pace. With more and more users working on computers to perform a myriad of tasks from online learning to shopping, interaction with such systems is an integral part of life. In such cases, recognizing a user’s engagement level with the system (s)he is interacting with can change the way the system interacts back with the user. This will lead not only to better engagement with the system but also pave the way for better human-computer interaction. Hence, recognizing user engagement can play a crucial role in several contemporary vision applications including advertising, healthcare, autonomous vehicles, and e-learning. However, the lack of any publicly available dataset to recognize user engagement severely limits the development of methodologies that can address this problem. To facilitate this, we introduce DAiSEE, the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. We have also established benchmark results on this dataset using state-of-the-art video classification methods that are available today. We believe that DAiSEE will provide the research community with challenges in feature extraction, context-based inference, and development of suitable machine learning methods for related tasks, thus providing a springboard for further research.
Format:
Default task: image classification.
EPIC-KITCHENS-100: The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers’ homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel ‘Pause-and-Talk’ narration interface. Format:
Default task: Various
cvimagery: Open source, community driven AI image datasets Never let a lack of image data stop you ever again. Register your account today, and get access to our repository immediately for free. Format:
Default task:
Winegrape Detection Dataset: Goal — To detect grape clusters in vineyards, Application — To monitor growth and analyze yield, Details — 300 images with 4400 bounding boxes over 5 classes of grapes Format:
Default task: Object detection
Global Wheat Detection Dataset: The data is images of wheat fields, with bounding boxes for each identified wheat head. Not all images include wheat heads / bounding boxes. The images were recorded in many locations around the world. Format:
Default task:
Object Detection in Low Lighting Conditions: In order to facilitate a new object detection and image enhancement research particularly in the low-light environment, we introduce the Exclusively Dark (ExDark) dataset (CVIU2019). The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 different conditions) with 12 object classes (similar to PASCAL VOC) annotated on both image class level and local object bounding boxes. Format:
Default task:
LARA Traffic Lights Detection Dataset: Goal — To detect traffic lights and classify them as red, green, and yellow, Application — This does rule-setting for adas and self-driving car systems at road network junctions, Details — 11K frames with 20K+ annotations over three classes of traffic lights Format:
Default task: Ref: How to utilize the dataset and build a custom detector using Mmdet-Faster-Rcnn-fpn50 pipeline
Pothole Detection Dataset: Goal — To detect potholes from on-road imagery, Application — Detecting road terrain and potholes results in smooth driving. Details — 700 images with 3K+ annotations on potholes Format:
Default task: Ref: How to utilize the dataset and build a custom detector using M-Rcnn pipeline
Nexet Vehicle Detection Dataset: Goal — To detect vehicles on-road imagery, Application — Detecting vehicles is a prime component in autonomous driving, Details — 7000 images with 15K+ annotations on 6 types of vehiclesFormat:
Default task: Ref: How to utilize the dataset and build a custom detector using Tensorflow Object Detection API
NBDD100K Adas Dataset: Goal — To detect on-road objects, Application — Detecting vehicles, traffic signs, and people is a prime component in autonomous driving, Details —100K images with 250K+ annotations on 10 types of objects
Format:
Default task:
Linkopings Traffic Signs Dataset: Goal — To detect traffic signs in images. Application — Detecting traffic signs is the first step towards understanding traffic rules. Details —3K images with 5K+ annotations on 40+ types of traffic signs Format:
Default task: Ref:How to utilize the dataset and build a custom detector using Mmdet — Cascade Mask Rcnn
Billboard Detection (Subsampling OpenImages Dataset) Dataset: Goal — To detect billboards in images. Application — Detecting billboards forms a crucial part in auto-analyzing marketing campaigns across the city. Details — 2K images with 5K+ annotations on billboards Format:
Default task: Ref:How to utilize the dataset and build a custom detector using Retinanet
DeepFashion2 Fashion element Detection Dataset: DeepFashion2 is a comprehensive fashion dataset. It contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers. It totally has 801K clothing clothing items, where each item in an image is labeled with scale, occlusion, zoom-incategory, style, bounding box, dense landmarks and per-pixel mask.There are also 873K Commercial-Consumer clothes pairs. The dataset is split into a training set (391K images), a validation set (34k images), and a test set (67k images). Format:
, viewpoint, Default task: Ref:How to utilize the dataset and build a custom detector using CornetNet-Lite Pipeline
Taobao Commodity Dataset: TCD contains 800 commodity images (dresses, jeans, T-shirts, shoes and hats) from the shops on the Taobao website. The ground truth masks of the TCD dataset are obtained by inviting common sellers of Taobao website to annotate their commodities, i.e., masking salient objects that they want to show from their exhibition. These images include all kinds of commodity with and without human models, thus having complex backgrounds and scenes with highly complex foregrounds. Pixel-accurate ground truth masks are given. These images including all kinds of commodities with and without human models have complex backgrounds and scenes with large foregrounds for evaluation. Figure 1 illustrates some of them. Format:
Default task:
Qmul-OpenLogo Logo Detection Dataset: Goal — To detect different logos in natural images. Application — Analyzing frequency of logo appearance in videos and natural scenes is crucial in marketing. Details — 16K training images with logos from all kinds of brands — food, vehicles, restaurant-chains, delivery services, airlines, etc Format:
Default task: Ref:How to utilize the dataset and build a custom detector using mx-rcnn pipeline
Football Detection Dataset (Subsampling from OpenImages Dataset): Goal — To detect football across frames in videos. Application — Detecting football positions is crucial in auto-analysing situations such as offsides, etc. Details — Around 3K training images. Format:
Default task: Ref:How to utilize the dataset and build a custom detector using yolo-v3 pipeline
Playing Card Type Detection: Goal — To detect playing card in natural images and classify the card type. Application — Possible application is in analyzing winning odds in different card games. Details — 500+ images over 52 card class types Format:
Default task: Ref:How to utilize the dataset and build a custom detector using mx-rcnn pipeline
Soccer Player Detection in Thermal Imagery: Goal — To localize and track players using thermal imagery. Application — Tracking players in the game is a crucial part in generating analytics. Details —3K+ images over 5K+ annotations. Format:
Default task: Ref:How to utilize the dataset and build a custom detector using mmdet faster-rcnn pipeline
SMIO-TCD Vehicle Detection in CCTV Traffic Cams: Goal — To detect vehicles in cctv traffic cameras. Application — Detecting vehicles in cctv traffic cams forms a crucial part in security surveillance applications. Details — 113K images with 200K+ annotations on 5+ types of vehicles Format:
Default task: Ref:How to utilize the dataset and build a custom detector using Mmdet — Retinanet pipeline
WIDER Person Detection Dataset: Goal — To detect people in cctv and natural scene images and videos. Application — CCTV based people detection forms the core of security and surveillance applications. Details — 10K+ images with 20K+ annotations on detecting pedestrians Format:
Default task: Ref:How to utilize the dataset and build a custom detector using Cornernet-Lite pipeline
Protective Gear — Helmet and Vest Detection: Goal — To detect helmet and vests on people. Application — This forms an integral part in security compliance monitoring.Details — 1.5K+ images with 2K+ annotations on detecting people, helmets, and vests Format:
Default task: Ref:How to utilize the dataset and build a custom detector using Mmdet — Cascade RPN
Anomaly Detection in Videosn: Goal — To classify videos as per actions being carried out in videos. Application — Detecting anomalies in real time helps in stopping crime. Details — 1K+ videos corresponding to 10 anomaly classes. Format:
Default task: Ref:How to utilize the dataset and build a custom classifier using mmaction-tsn50 pipeline
TACO Trash Detection Dataset: Goal — To localize and segment all kinds of garbage in images. Application — Critical component in autonomous bots trying to tackle trash problem in public places. Details — 10K images with 15K+ annotations over 20+ different classes trash objects
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using Retinanet pipeline
Indoor Scene General Object Detection Dataset: Goal — To localize and detect indoor objects in images. Application — Autotag images in real-estate and rental websites with amenities. Details — 3K+ images with 5K+ annotations over 10+ different classes indoor objects such as electronic-appliances, bed, curtains, chairs, etc
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using Retinanet pipeline
EgoHands Hand Segmentation Dataset: Goal — To segment hands in natural scenes. Application — First step towards understanding gestures, with applications in human computer interaction, sign language recognition. Details — 4.8K+ images with corresponding hand masks.
Format:
Default task:
Ref:How to utilize the dataset and build a custom detector using Retinanet pipeline
UCF Action recognition dataset: Goal — To classify videos as per actions being carried out in videos. Application — Tagging videos is important in storing and retrieving large number of videos. Details — 1K+ videos corresponding to 101 action type classes.
Format:
Default task:
Ref:How to utilize the dataset and build a custom classifier using mmaction-tsn50 pipeline

Text data

Reviews

Amazon reviews: Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.
Format: Text
Default task: Classification, sentiment analysis.
Car Evaluation Data Set: Car properties and their overall acceptability.
Format: Text
Default task: Classification.
YouTube Comedy Slam Preference Dataset: User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.
Format: Text
Default task: Classification.
Skytrax User Reviews Dataset: A scraped dataset created from all user reviews found on Skytrax (www.airlinequality.com). It is unknown under which license Skytrax published these reviews. However, the reviews are accessible by anyone with a browser and the robots.txt on their website did not specifically prohibit the scraping of them.
Format: Text
Default task: Classification, regression.
Teaching Assistant Evaluation Dataset: The data consist of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly equal-sized categories (“low”, “medium”, and “high”) to form the class variable.
Format: Text
Default task: Classification.
Inside AirBnB Dataset: Sourced from publicly available information from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion.
Format: Text, Zipped Default task: Classification, Topic Modeling

News Articles

NYSK Dataset: English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Format: XML, text
Default task: Sentiment analysis, topic extraction.
The Reuters Corpus Volum1 1 & 2: Large corpus of Reuters news stories in multiple languages.
Format: XML, text
Default task: Classification, clustering, summarization.
Thomson Reuters Text Research Collection (TRC2): The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14 or 2,871,075,221 bytes, and was initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). TRC2 is distributed via web download.
Format: XML, text
Default task: Classification, clustering, summarization.
Saudi Newspapers Corpus : 31,030 Arabic newspaper articles.
Format: json
Default task: Summarization, clustering.
RE3D (Relationship and Entity Extraction Evaluation Dataset): Entity and Relation marked data from various news and government sources. Sponsored by Dstl.
Format: json
Default task: Classification, Entity and Relation recognition
ABC Australia News Corpus: Entire news corpus of ABC Australia from 2003 to 2017.
Format: CSV
Default task:Clustering, Events, Sentiment
Examiner Pseudo-News Corpus: Clickbait, spam, crowd-sourced headlines from 2010 to 2015.
Format: CSV
Default task: Clustering, Events, Sentiment
Worldwide News - Aggregate of 20K Feeds: One week snapshot of all online headlines in 20+ languages.
Format: CSV
Default task: Clustering, Events, Language Detection
The Irish Times IRS: 12 Years of Events From Ireland.
Format: CSV
Default task: NLP, Computational Linguistics, Events

Messages

Enron Email Dataset: This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
Format: Text
Default task: Network analysis, sentiment analysis
Enron Email Dataset: A dataset that contains spam messages and messages from the Linguist list.
Format: Text
Default task: Classification
PU datasets: A collection of encrypted datasets that contain spam messages and ham messages from real users.
Format: Text
Default task: Classification
SMS Spam Collection Dataset: The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
Format: Text
Default task: Classification
Twenty Newsgroups Dataset: The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
Format: Text
Default task: Natural language processing
Spambase Dataset: Spam emails.
Format: Text
Default task: Spam detection, classification

Twitter and tweets

MovieTweetings: Movie rating dataset based on public and well-structured tweets.
Format: Text
Default task: Classification, regression
Twitter100k: Pairs of images and tweets .
Format: Text and Images
Default task: Cross-media retrieval
Sentiment140: Tweet data from 2009 including original text, time stamp, user and sentiment.
Format: Tweets, comma, separated values
Default task: Sentiment analysis
ASU Twitter Dataset: Twitter network data, not actual tweets. Shows connections between a large number of users.
Format: Text
Default task: Clustering, graph analysis
SNAP Social Circles: Twitter Database: This dataset consists of ‘circles’ (or ‘lists’) from Twitter. Twitter data was crawled from public sources. The dataset includes node features (profiles), circles, and ego networks.
Format: Text
Default task: Clustering, graph analysis
Twitter Dataset for Arabic Sentiment Analysis: Arabic tweets.
Format: Text
Default task: Classification
Buzz in Social Media Dataset: Data from Twitter and Tom’s Hardware. This dataset focuses on specific buzz topics being discussed on those sites..
Format: Text
Default task: Regression, Classification
Paraphrase and Semantic Similarity in Twitter (PIT): This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.
Format: Text
Default task: Regression, Classification
Geoparse Twitter benchmark dataset: This dataset contains tweets during different news events in different countries. Manually labeled location mentions.
Format: Tweets, JSON
Default task: Classification, Information Extraction

Dialogues

NPS Chat Corpus: Posts from age-specific online chat rooms.
Format: XML
Default task: NLP, programming, linguistics
Twitter Triple Corpus: A-B-A triples extracted from Twitter.
Format: Text
Default task: NLP
UseNet Corpus: UseNet forum postings.
Format: Text
Default task:
NUS SMS Corpus: SMS messages collected between two users, with timing analysis.
Format: Text
Default task: NLP
Reddit All Comments Corpus: This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues. .
Format: JSON
Default task: NLP, Research
Ubuntu Dialogue Corpus v2.0: Dialogues extracted from Ubuntu chat stream on IRC.
Format: CSV
Default Task: Dialogue Systems Research
Coached Conversational Preference Elicitation: A dataset consisting of 502 dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an ‘assistant’, while the other plays the role of a ‘user’.
Format: JSON
Default Task: Text annotation
Taskmaster-1 dataset: The full Taskmaster-1 dialog dataset has total 13,215 dialogs with 7708 written and 5507 spoken. A full description of the data is provided in readme.txt. To get a basic idea of the dialog content see sample.json. The annotation schema is viewable in ontology.json.
Format: JSON
Default Task: Text annotation

Question Answering

SQuad:
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset which includes questions posed by crowd-workers on a set of Wikipedia articles and the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. The dataset was presented by researchers at Stanford University and SQuAD 2.0 contains more than 100,000 questions.
Format:
Default task:
Natural Questions (NQ):
Natural Questions (NQ) is a new, large-scale corpus for training and evaluating open-domain question answering systems. Presented by Google, this dataset is the first to replicate the end-to-end process in which people find answers to questions. It contains 300,000 naturally occurring questions, along with human-annotated answers from Wikipedia pages, to be used in training QA systems. Furthermore, researchers added 16,000 examples where answers (to the same questions) are provided by 5 different annotators which will be useful for evaluating the performance of the learned QA systems.
Format:
Default task:
Question Answering in Context:
Question Answering in Context (QuAC) is a dataset for modeling, understanding, and participating in information seeking dialog. In this dataset, instances consist of an interactive dialogue between two crowd workers which is a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and a teacher who answers the questions by providing short excerpts (spans) from the text. It contains 14K information-seeking QA dialogs which include 100K QA pairs in total.
Format:
Default task:
Conversational Question Answering (Coca):
Conversational Question Answering (CoQA), pronounced as Coca is a large-scale dataset for building conversational question answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. The dataset contains 127,000+ questions with answers collected from 8000+ conversations.
Format:
Default task:
HOTPOTQA:
HOTPOTQA is a dataset which contains 113k Wikipedia-based question-answer pairs with four key features. These are questions that require finding and reasoning over multiple supporting documents to answer, the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas, sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain predictions and a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.
Format:
Default task:
ELI5:
ELI5 (Explain Like I’m Five) is a longform question answering dataset. It is a large-scale, high-quality data set, together with web documents, as well as two pre-trained models. The dataset is created by Facebook and it comprises of 270K threads of diverse, open-ended questions that require multi-sentence answers.
Format:
Default task:
ShARC:
Shaping Answers with Rules through Conversations (ShARC) is a QA dataset which requires logical reasoning, elements of entailment/NLI and natural language generation. The dataset consists of 32k task instances based on real-world rules and crowd-generated questions and scenarios.
Format:
Default task:
MS MARCO:
MS MARCO or Human Generated MAchine Reading COmprehension Dataset is a large-scale dataset created by Microsoft AI & Research. The dataset consists of 1,010,916 anonymized question which is sampled from Bing’s search query logs, each with a human-generated answer and 182,669 completely human rewritten generated answers. This dataset is mainly intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas
Format:
Default task:
TWEETQA:
TWEETQA is a social media-focused question answering dataset. This dataset is created by the researchers at IBM and the University of California and can be viewed as the first large-scale dataset for QA over social media data. The dataset now includes 10,898 articles, 17,794 tweets, and 13,757 crowdsourced question-answer pairs.
Format:
Default task:
NEWSQA:
NewsQA is a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. The dataset is collected from crowd-workers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text from the corresponding articles. The dataset contains 119,633 natural language questions posed by crowd-workers on 12,744 news articles from CNN.
Format:
Default task:
Textbook Question Answering:
The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks. This includes 26,260 questions, including 12,567 that have an accompanying diagram.
Format:
Default task:

Other text

Web of Science Datasets: Hierarchical Datasets for Text Classification.
Format: Text
Default task: Classification, Categorization
Legal Case Reports: Federal Court of Australia cases from 2006 to 2009.
Format: Text
Default task: Summarization, citation analysis
Blogger Authorship Corpus: Blog entries of 19,320 people from blogger.com
Format: Text
Default task: Sentiment analysis, summarization, classification
Social Structure of Facebook Networks: Large dataset of the social structure of Facebook.
Format: Text
Default task: Network analysis, clustering
Dataset for the Machine Comprehension of Text: Stories and associated questions for testing comprehension of text.
Format: Text
Default task: Natural language processing, machine comprehension
DEXTER Dataset: Task given is to determine, from features given, which articles are about corporate acquisitions.
Format: Text
Default task: Classification
The Penn Treebank Project: Naturally occurring text annotated for linguistic structure.
Format: Text
Default task: Natural language processing, summarization
DEXTER Dataset: Task given is to determine, from features given, which articles are about corporate acquisitions.
Format: Text
Default task: Classification
Google Books N-grams: N-grams from a very large corpus of books.
Format: Text (2.2 TB)
Default task: Classification, clustering, regression
Personae Corpus: Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Format: Text Default task: Classification, regression
Stack Exchange Data Dump: This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.
Format: Text
Default task: Classification, regression
TyDi QA: This repository contains information about TyDi QA, code for evaluating results on the dataset, implementations of baseline systems for the dataset, and some advice for working with the dataset.
Format: Text
Default task:
KILT Benchmarking: KILT is a resource for training, evaluating and analyzing NLP models on Knowledge Intensive Language Tasks.
Format: Text
Default task:
GoEmotions: A Dataset for Fine-Grained Emotion Classification: https://github.com/google-research/google-research/tree/master/goemotions a human-annotated dataset of 58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories . As the largest fully annotated English language fine-grained emotion dataset to date, we designed the GoEmotions taxonomy with both psychology and data applicability in mind. In contrast to the basic six emotions, which include only one positive emotion (joy), our taxonomy includes 12 positive, 11 negative, 4 ambiguous emotion categories and 1 “neutral”, making it widely suitable for conversation understanding tasks that require a subtle differentiation between emotion expressions.
Format: Text
Default task: Classification

Medical Datasets

Medical Data for Machine Learning: This is a curated list of medical data for machine learning.
Breast Cancer Wisconsin (Diagnostic) Data Set: This is a curated list of medical data for machine learning. Format: Structured Data Default task: Classification, anomaly detection
Clinical Skin Disease Images: SD-198 is an image dataset for automatic recognition and diagnosis of skin diseases. It contains 198 skin diseases and 6,584 clinical images, which are collected by digital cameras and mobile phones. Format: Image Data Default task: Classification, anomaly detection
Skin Cancer MNIST: HAM10000: a large collection of multi-source dermatoscopic images of pigmented lesions Format: Image Data Default task: Classification, anomaly detection
The National Library of Medicine Data Distribution: The NLM Data Distribution Program is the preferred access point for bulk downloading of the datasets listed below. Downloading and use of these datasets is completely free of charge. Format: Various Default task: Various
PhysioNet: This page displays an alphabetical list of all the databases on PhysioNet. To search content on PhysioNet, visit the search page. Enter the search terms, add a filter for resource type if needed, and select how you would like the results to be ordered (for example, by relevance, by date, or by title).
Each project is made available under one of the following access policies:
Open Access: Accessible by all users, with minimal restrictions on reuse.
Restricted Access: Accessible by registered users who sign a Data Use Agreement.
Credentialed Access: Accessible by registered users who complete the credentialing process and sign a Data Use Agreement.
Format: Various
Default task: Various
WHO Life Expectancy: Another good one for experimenting with your EDA skills also.
Ultrasound Brachial Plexus (BP) Nerve Segmentation Dataset: Goal — To segment certain nerve types in ultrasound images. Application — This helps in improving pain management through the use of indwelling catheters that block or mitigate pain at the source. Details — 11K+ images with associated instance masks for detecting nerves Format: Default task: Ref: How to utilize the dataset and build a custom detector
PanNuke Cancer Instance Segmentation in Cells: Goal — To segment different cell types in the slide image. Application — Auto-analyzing presence of cancerous and dead cells in terabytes of data. Details — 3K+ images with associated instance masks for detecting different cell types
Format: Default task: Ref:How to utilize the dataset and build a custom detector

Audio Datasets

Common Voice: Common Voice is a massive global database of donated voices that lets anyone quickly and easily train voice-enabled apps in potentially every language.
(Singapore) National Speech Corpus: First announced in November 2017, the first version of the National Speech Corpus (NSC) is now available for download. It contains 2,000 hours of locally accented audio and corresponding text transcriptions. There are more than 40,000 unique words within the text transcriptions comprising local words such as “Tanjong Pagar”, “ice kachang”, or “nasi lemak”. The data is made available via the Singapore Open Data Licence. Automatic speech recognition engines use multiple corpus collections (collectively called corpora) to accurately train themselves to interpret spoken words and transcribe them. The NSC thus enables global technology providers to provide speech-related applications such as voice assistants, for use here. The NSC will be continually updated.
The VoxCeleb1 Dataset: VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
The VoxCeleb2 Dataset: VoxCeleb2 contains over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to YouTube. The development set of VoxCeleb2 has no overlap with the identities in the VoxCeleb1 or SITW datasets.
LibriSpeech ASR corpus: LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.
LibriTTS corpus: LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
CSTR VCTK Corpus : This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker’s accent. The newspaper texts were taken from The Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set was selected using a greedy algorithm designed to maximise the contextual and phonetic coverage. The Rainbow Passage and elicitation paragraph are the same for all speakers.
The M-AILABS Speech Dataset: The M-AILABS Speech Dataset is the first large dataset that are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis.
MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection Default task: Anomaly Detection
RAVDESS Dataset: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18. Default task: Emotion Classification

IoT Datasets

CGIAR dataset: High-resolution climate datasets for a variety of fields including agricultural
Educational Process Mining: Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator
Commercial Building Energy Dataset: Energy related data set from a commercial building where data is sampled more than once a minute.
Individual household electric power consumption: One-minute sampling rate over a period of almost 4 years
AMPds dataset: AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring.
UK Domestic Appliance-Level: Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded.
PhysioBank databases (Healthcare): Archive of over 80 physiological datasets.
Saarbruecken Voice Database (Healthcare): A collection of voice recordings from more than 2000 persons for pathological voice detection.
T-LESS (Industry): An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects
CityPulse Dataset Collection (Smart City): Road Traffic Data, Pollution Data, Weather, Parking
Open Data Institute – node Trento (Smart City): Weather, Air quality, Electricity, Telecommunication
Malaga datasets (Smart City): A broad range of categories such as energy, ITS, weather, Industry, Sport, etc.
Gas sensors for home activity monitoring (Smart Home): Recordings of 8 gas sensors under three conditions including background, wine and banana presentations.
CASAS datasets for activities of daily living (Smart Home): Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings.
ARAS Human Activity Dataset (Smart Home): Human activity recognition datasets collected from two real houses with multiple residents during two months.
MERLSense Data (Smart Home): Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records.
SportVU (Sport): Video of basketball and soccer games captured from 6 cameras.
RealDisp (Sport): Includes a wide range of physical activities (warm up, cool down and fitness exercises).
Taxi Service Trajectory (Transportation): Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal.
GeoLife GPS Trajectories (Transportation): A GPS trajectory by a sequence of time-stamped points
T-Drive trajectory data (Transportation): Contains a one-week trajectories of 10,357 taxis
Chicago Bus Traces data (Transportation): Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds.
Uber trip data (Transportation): About 20 million Uber pickups in New York City during 12 months
Traffic Sign Recognition (Transportation): Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules.
DDD17 (Transportation): End-To-End DAVIS Driving Dataset.
High Storage System Data for Energy Optimization: In the Smartfactory in Lemgo is a demonstrator of a high storage system. The high storage system was built and used in previous research projects, for example in IMPROVE. Its focus is on data-driven energy optimization. It is also used to perform anomaly detection using timed automata..
Pump sensor data for predictive maintenance: I have a friend who working in a small team that taking care of water pump of a small area far from big town, there are 7 system failure in last year. Those failure cause huge problem to many people and also lead to some serious living problem of some family. The team can’t see any pattern in the data when the system goes down, so they are not sure where to put more attention.

Recommender Datasets

Book

Book Crossing:: The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August / September 2004) from the Book-Crossing community
Goodreads Books:Detailed information about books through numerous columns for building a book recommender engine. This is my personal favourite for getting a hang out of actually attempting the recommendation task.

Dating

Dating Agency:: This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006.

E-commerce

Amazon: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014
Retailrocket recommender system dataset:: The dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website.

Music

Amazon Music: This digital music dataset contains reviews and metadata from Amazon
Yahoo Music:: This dataset represents a snapshot of the Yahoo! Music community’s preferences for various musical artists.
LastFM (Implicit):: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
Million Song Dataset:: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Movies

MovieLens: GroupLens Research has collected and made available rating datasets from their movie web site
Yahoo Movies:: This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services.
CiaoDVD:: CiaoDVD is a dataset crawled from the entire category of DVDs from the dvd.ciao.co.uk website in December, 2013
FilmTrust:: FilmTrust is a small dataset crawled from the entire FilmTrust website in June, 2011
Netflix:: This is the official data set used in the Netflix Prize competition.
Netflix Data: collection of movies and TV shows details until 2019, also a great one for some practical exposure to a real world application.
Cornell University: Cornell University - Movie-review data for use in sentiment-analysis experiments
Douban Dataset: This is the anonymized Douban dataset contains 129,490 unique users and 58,541 unique movie items. The total number of movie ratings is 16,830,839. For the social friend network, there are a total of 1,692,952 claimed social relationships.
Epinions Dataset: Epinions is a website where people can review products. Users can register for free and start writing subjective reviews about many different types of items (software, music, television show, hardware, office appliances, …). A peculiar characteristics of Epinions is that users are paid according to how much a review is found useful (Income Share program).
Popular Movies from IMDb: A classic crowd-sourced movie information database for starting out, in which you need to predict which movie to recommend.

Games

Steam Video Games: This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are ‘purchase’ and ‘play’. The value indicates the degree to which the behavior was performed - in the case of ‘purchase’ the value is always 1, and in the case of ‘play’ the value represents the number of hours the user has played the game.

Jokes

Jester: This Joke dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users

Food

Chicago Entree: This dataset contains a record of user interactions with the Entree Chicago restaurant recommendation system.

Anime

Anime Recommendations Database: This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Scholarly Paper

National University of Singapore - Scholarly Paper Recommendation :

Healthcare

Medicare.gov: This site provides direct access to the official data from the Centers for Medicare & Medicaid Services (CMS) that are used on the Medicare.gov Compare Websites and Directories. The goal of the site is to make these CMS data readily available in open, accessible, and machine-readable formats. Includes Hospital Compare DataSet, Nursing Home compare datasets and more.

Others

Subreddit Recommender: This is one of my recent favourites, and with this dataset, you need to take into account each user’s comments in subreddits and then predict some new subreddits to recommend to them. If you’re sick of all the repetitive movie datasets, I would say to try this one for sure!

Anomaly Data

Numenta Anomaly Benchmark (NAB): Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.
KDD Cup 1999 Data: This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood’’ normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
Outlier Detection DataSets (ODDS): a large collection of outlier detection datasets with ground truth (if available).
Credit Card Fraud Detection: The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Outlier Detection Data Sets: collected data sets for outlier detection (mirror) and studied the performance of many algorithms and parameters on these data sets (using ELKI,
Unsupervised Anomaly Detection Benchmark: These datasets can be used for benchmarking unsupervised anomaly detection algorithms (for example “Local Outlier Factor” LOF). The datasets have been obtained from multiple sources and are mainly based on datasets originally used for supervised machine learning. By publishing these modifications, a comparison of different algorithms is now possible for unsupervised anomaly detection.
Anomaly Detection Meta-Analysis Benchmarks::
DDoS Evaluation Dataset (CICDDoS2019): Distributed Denial of Service (DDoS) attack is a menace to network security that aims at exhausting the target networks with malicious traffic. Although many statistical methods have been designed for DDoS attack detection, designing a real-time detector with low computational overhead is still one of the main concerns. On the other hand, the evaluation of new detection algorithms and techniques heavily relies on the existence of well-designed datasets.
Unsupervised Anomaly Detection Benchmark: These datasets can be used for benchmarking unsupervised anomaly detection algorithms (for example “Local Outlier Factor” LOF). The datasets have been obtained from multiple sources and are mainly based on datasets originally used for supervised machine learning. By publishing these modifications, a comparison of different algorithms is now possible for unsupervised anomaly detection.
Anomaly detection in 4G cellular networks: TThe dataset is split into training (~80%) and test (~20%) subsets provided as two separate CSV files. The training set: ML-MATT-CompetitionQT2021_train.csv contains 36,904 samples, each having 13 features and a label. Note that there may be erroneous samples and outliers. The test set: ML-MATT-CompetitionQT2021_test.csv contains 9,158 samples following the same structure as the training set but not including the labels. You will have to upload the predictions for the test set and Kaggle will compare your predictions with the ground-truth labels in order to compute a score.

Text Classification

BBC Datasets: Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. These datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format.. Format:
Default task:
Movie Lens Dataset: The data set was collected over various periods of time, depending on the size of the set. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
Format: text
Default task: Text classification, Regression, clustering.
OPIN-RANK REVIEW Dataset: This dataset contains full reviews for cars and hotels collected from TripAdvisor (~259,000 reviews) and Edmunds (~42,230 reviews).
Format: text
Default task: classification, Sentiment analysis, clustering.
Cyber-Trolls Dataset: Dataset used to classify tweets as aggressive or not to help fight trolls. The dataset has 20001 items of which 20001 items have been manually labeled. There are 2 categories 1(Cyber-Aggressive) and 0 (Non-Cyber-Aggressive). These are Human labeled dataset.
Format: Text
Default Task: Text classification
Chat Messages By Category Dataset: The dataset has 20001 items of which 68 items have been manually labeled. A text classification dataset with 8 classes like Alcohol & Drugs, Profanity & Obscenity, Sex, religion etc.
Format: Text
Default Task: Text classification
SPAMBASE Dataset: The Spam base data set includes 4601 observations corresponding to email messages, 1813 of which are spam. From the original email messages, 58 different attributes were computed. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
Format: Text
Default task: Spam detection, classification
Sentiment140 Dataset: Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter. use causes Brand management (e.g. Windows 10), Polling (e.g. Obama), Planning a purchase (e.g. Kindle)
Format: Text
Default Task: Sentiment analysis
Distress classification Dataset: This is a text classification dataset for classification of news headlines/articles based on whether they are distressed or not. The dataset has 1983 items of which 1983 items have been manually labeled. Labels are distress and not-distress.
Format: Text
Default Task: Text classification
Blog Authorship Dataset: The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. For each age group, there is an equal number of male and female bloggers.
Format: Text
Default Task: Sentiment analysis, summarization, classification
Musk Dataset: This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. Because bonds can rotate, a single molecule can adopt many different shapes. This many-to-one relationship between feature vectors and molecules is called the “multiple instance problem”. When learning a classifier for this data, the classifier should classify a molecule as “musk” if ANY of its conformations is classified as a musk. A molecule should be classified as “non-musk” if NONE of its conformations is classified as a musk.
Format: Text
Default Task:Text Classification
Commentary Dataset: Comments in the matches classified as humor, praise, stats, teasing etc.. The dataset has 1408 items of which 1287 items have been manually labeled. These labels are classified into 23 categories such as injury, audience, feeling, communication, teasing etc.
Format: Text
Default Task:Text Classification
Emotion Classification Dataset : The Dataset consists of data which is labeled with different sentiments. The dataset has 269 items of which 269 items have been manually labeled. These are divided into 7 categories happy, sad, excited, angry, scared, tender, others
Format: Text
Default Task:Text Classification
NSDUH Dataset: The National Survey on Drug Use and Health (NSDUH) series, formerly titled National Household Survey on Drug Abuse, is a major source of statistical information on the use of illicit drugs, alcohol, and tobacco and on mental health issues among members of the U.S. There are 55,268 instances in the Dataset.
Format: Text
Default Task:Text classification, regression
Zoo Dataset: A simple database containing 17 Boolean-valued attributes. Animals are classed into 7 categories and features are given for each. Format: Text
Default Task:Text classification
URL Dataset: This Dataset is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.
Format: Text
Default Task:Text classification

Machine Translation

Movie Lens Dataset: The data set was collected over various periods of time, depending on the size of the set. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
Format: text
Default task: Text classification, Regression, clustering.

Plant disease

CrowdAI’s PlantVillage Disease Classification Challenge: 38 classes of crop disease pairs.To learn more about the background of the dataset, please refer to the following paper: http://arxiv.org/abs/1511.08060.
Format: Image
Default task: classification.
Catalog.data.gov: 4 datasets.
Format: Various
Default task: various.
Plant Image Analysis: Contains more then 28 datasets on various species and plant species.
Format: images
Default task: various.
Images of maize: his repository contains images of maize (corn) leaves that have been annotated to mark lesions caused by Northern Leaf Blight (NLB), a common and devastating disease of maize. In total, there are 18,222 images, all taken in the field, and 105,735 annotations by one of two human experts. This is the largest publicly available collection of classified images of any single plant disease..
Format: images
Default task: various.

Multivariate data

Financial

World Bank Open Data: Weekly data of stocks from the first and second quarters of 2011.
Format: Various
Default task: Various.
IMF Data: International Monetary Fund’s collection of open data for things like debt rates, commodity pricing, international markets, and foreign exchange reserves.
Format: Various
Default task: Various.
Dow Jones Index: Weekly data of stocks from the first and second quarters of 2011.
Format: Comma separated values
Default task: Classification, regression, Time series.
Statlog (Australian Credit Approval): Credit card applications either accepted or rejected and attributes about the application.
Format: Comma separated values
Default task: Classification.
eBay auction data: Auction data from various eBay.com objects over various length auctions.
Format: Text
Default task: Regression, classification.
Statlog (German Credit Data): Binary credit classification into “good” or “bad” with many features.
Format: Text
Default task: classification. Imbalanced classification.
Bank Marketing Dataset: Data from a large marketing campaign carried out by a large bank.
Format: Text
Default task: classification.
Istanbul Stock Exchange Dataset: Several stock indexes tracked for almost two years.
Format: Text
Default task: Classification, regression.
Default of Credit Card Clients: Credit default data for Taiwanese creditors.
Format: Text
Default task: Classification, regression.
Lending Club Loan Data: These files contain complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the “present” contains complete loan data for all loans issued through the previous completed calendar quarter.
Format: Text CSV
Default task: Classification, regression.
- Tesla dataset: A stock price dataset for all the Tesla fans, and for those who enjoy dabbling into the intricacies of the financial industry..
  Format: Text CSV
  Default task: regression.

Demand and Sales forecasting

Grupo Bimbo Inventory Demand: Forecast the demand of a product for a given week, at a particular store. The dataset consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week.
Format: Zipped text
Default task: Demand forecasting.
Online Product Sales: Predict monthly online sales of a product. Imagine the products are online self-help programs following an initial advertising campaign.
Format: Zipped text
Default task: Sales/Demand forecasting.
Historical Sales and Active Inventory: to determine which products we should continue to sell, and which products to remove from our inventory. The file contains BOTH historical sales data AND active inventory, which can be discerned with the column titled “File Type”..
Format: Zipped text
Default task: Sales/Demand forecasting.
Forecasts for Product Demand: The dataset contains historical product demand for a manufacturing company with footprints globally. The company provides thousands of products within dozens of product categories.
Format: Zipped text
Default task: Sales/Demand forecasting.
Store Item Demand Forecasting Challenge: This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset. You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores. What’s the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost? This is a great competition to explore different models and improve your skills in forecasting.
Format: Zipped text
Default task: Sales/Demand forecasting.

Other Multivariate

Home Theatre Info dataset: Containing movie metadata information on over 250,000 DVDs offered in North America.
Format: Zipped text
Default task: Assoication Discovery.
Abalone Data Set : Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.
Format: text
Default task: Classification, regression.
Real estate valuation data set Data Set: The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. The â€œreal estate valuationâ€ is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).
Format: xls
Default task: regression.
Residential Building Data Set Data Set: Data set includes construction cost, sale prices, project variables, and economic variables corresponding to real estate single-family residential apartments in Tehran, Iran.
Format: xls
Default task: regression.
One-class classifer Dataset: Some datasets are originally normal / anomaly, other datasets were modified from UCI datasets. Format:
Default task:
Example data sets for ELKI: We are collecting a few example data sets along with a description to try out ELKI. Many of the data sets are artificial test cases that we use in internal unit testing, and are not well suited for benchmarking due to various biases, but mostly meant for use in teaching. Often they work near-perfectly for one algorithm, while another algorithm fails badly and are used to explain strengths and weaknesses of different approaches. They are not meant to even just resemble real data. Format: XML Default task: Anomaly

Time Series

Hard Drive Failure Rates: Alternate: https://www.kaggle.com/backblaze/hard-drive-test-data
Each day, Backblaze takes a snapshot of each operational hard drive that includes basic hard drive information (e.g., capacity, failure) and S.M.A.R.T. statistics reported by each drive.
Format: Various
Default task: Time Series, Predictive Maintenance.
Databanks International Cross National Time Series Data Archive (Paid): More than 200 years of annual data from 1815 onward. Over 200 countries.196 variables used by academia, government, finance and media.
Format: Excel
Default task: Time Series.
Big Dataset in Predictive Maintenance: The data set has around 2 million records with 172 columns simulated for 1900 machines collected over 4 years. Each machine includes a device which stores data such as warnings, problems and errors generated by the machine over time. Each record has a Device ID and time stamp for each day and aggregated features for that day such as total number of a certain type of warning received in a day. Four categorical columns were also included to demonstrate generic handling of categorical variables. The goal is to predict if a machine will fail in the next 7 days. The last column of the data set indicates if a failure occurred and reported on that day (https://github.com/Azure/PySpark-Predictive-Maintenance).
(https://github.com/Azure/PySpark-Predictive-Maintenance)
Format: csv
Default task: Predictive Maintenance.
National Aeronautics and Space Administration: The Prognostics Data Repository is a collection of data sets that have been donated by various universities, agencies, or companies. The data repository focuses exclusively on prognostic data sets, i.e., data sets that can be used for development of prognostic algorithms. Mostly these are time series of data from some nominal state to a failed state. The collection of data in this repository is an ongoing process.
Format: various
Default task: Predictive Maintenance.
ECG5000: https://archive.physionet.org/cgi-bin/atm/ATM The original dataset for “ECG5000” is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure Database(chfdb) and it is record “chf07”. It was originally published in “Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23)”. The data was pre-processed in two steps: (1) extract each heartbeat, (2) make each heartbeat equal length using interpolation. This dataset was originally used in paper “A general framework for never-ending learning from time series streams”, DAMI 29(6). After that, 5,000 heartbeats were randomly selected. The patient has severe congestive heart failure and the class values were obtained by automated annotation.
Format: images
Default task: anomaly detection
UEA & UCR Time Series Classification Repository: This website is an ongoing project to develop a comprehensive repository for research into time series classification. If you use the results or code, please cite the paper “Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large and Eamonn Keogh, The Great Time Series Classification Bake Off: a Review and Experimental Evaluation of Recent Algorithmic Advances, Data Mining and Knowledge Discovery, 31(3), 2017”. Paper Link, Bibtex Link. We are in the process of updating all the results for the new dataset.
Format: various
Default task: various
NASA Acoustics and Vibration Database: —.
Format: various
Default task: various
NASA PCoE Datasets: The Prognostics Data Repository is a collection of data sets that have been donated by various universities, agencies, or companies. The data repository focuses exclusively on prognostic data sets, i.e., data sets that can be used for development of prognostic algorithms. Mostly these are time series of data from some nominal state to a failed state. The collection of data in this repository is an ongoing process.
Format: Tabular Data Default task: various
E-Commerse Sales: For predicting sales/transaction for a store. The classic time series forecasting job.
Minimum Daily Temperatures: This dataset describes the minimum daily temperatures over 10 years in the city Melbourne, Australia.
Microsoft Stock: Another stock dataset for you to experiment with, this one wants you to predict Microsoft’s stock prices based on five-six years of historical data.

Graphs

Open Graph Benchmark: The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader, which is fully compatible with Pytorch Geometric and DGL. The model performance can be evaluated using the OGB Evaluator in a unified manner.
Format:
Default task:

Document Analysis

Marmot Dataset: In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool “Marmot”. The dataset is composed of Chinese and English pages at the proportion of about 1:1.
Format:
Default task:
DL-Hard: Deep Learning Hard (DL-HARD) is a new annotated dataset building upon standard deep learning benchmark evaluation datasets. It builds on TREC Deep Learning (DL) questions extensively annotated with query intent categories, answer types, wikified entities, topic categories, and result type metadata from a leading web search engine. Based on this data, we introduce a framework for identifying challenging questions. DL-HARD contains 50 queries from the official 2019/2020 evaluation benchmark, half of which are newly and independently assessed. We perform experiments using the official submitted runs to DL on DL-HARD and find substantial differences in metrics and the ranking of participating systems. Overall, DL-HARD is a new resource that promotes research on neural ranking methods by focusing on challenging and complex queries..
Format:
Default task:

General Classifications

Campus Recruitment: Determine if a student gets placed in a company based on various features like their education, grades, and so on.
Format:
Default task:
Australian Fatal Road Accident 1989–2021: This is a fairly new dataset, and you need to classify the crash type from the various features available about the crash such as time and day of the crash, speed of the vehicle, etc.
Format:
Default task:
Heart Disease UCI: To predict the presence of heart disease in the patient based on a set of 76 different physiological attributes of an individual Format:
Default task:

- CelebFaces Attributes (CelebA) Dataset: A popular one to use over 200k images of celebrities and use Computer vision concepts for implementing facial recognition. Format:
Default task:

General Regression

Red Wine Quality: A dataset to predict the quality of wines using wine attributes such as fixed acidity, chlorides, citrus content and so on. This is a fun dataset I’d recommend experimenting with if you’re already familiar with a little bit of regression and have practised on the dataset 1 above.
Format:
Default task:

eSports Datasets

CS:GO Competitive Matchmaking Data: Video games are a rich area for data extraction due to their digital nature. Notable examples such as the complex EVE Online economy, World of Warcraft corrupted blood incident and even Grand Theft Auto self-driving cars tells us that fiction is closer to reality than we really think. Data scientists can gain insight on the logic and decision-making that the players face when put in hypothetical and virtual scenarios.

In this Kaggle Dataset, I provide just over 1400 competitive matchmaking matches from Valve’s game Counter-strike: Global Offensive (CS:GO). The data was extracted from competitive matchmaking replays submitted to csgo-stats. I intend for this data-set to be purely exploratory, however users are free to create their own predictive models they see fit.
Format:
Default task:

FIFA 2021 Complete Player Dataset: The data set contains data of players Rating, their age, their nationality, the position that they play and their potential for growth in game. The data for a few players and their clubs might not be very accurate as the transfer window is still open and changes might be made in later stages.
Format:
Default task:
the OpenDota API: About a year and a half ago, we exported our first “data dump” of all the parsed data we’d collected since OpenDota started operating, which consisted of over 3.5 million matches ranging from January 2015 to December 2015.

After a series of adventures, we’re happy to announce that we’re finally ready to release a second data dump of over a billion matches, this time with information ranging from March 2011 to March 2016! Format:
Default task:

Synthetic Datasets

SyntheticFur Dataset: Collecting and generating high quality fur images is an expensive and difficult process that requires content specialists to generate. By releasing this unique dataset with high quality lighting simulation via ray tracing, this can save time for researchers seeking to advance studies of fur rendering and simulation, without having to recreate this laborious process.

The dataset was used for neural rendering research at Google that takes advantage of rasterized image buffers and converts them into high quality raytraced fur renders. We believe that this dataset can contribute to the computer graphics and machine learning community to develop more advanced techniques with fur rendering.
Format:
Default task: