Data Used in Machine Learning Research

Artificial Data

Real-World Data

The datasets below were obtained from the UCI Machine Learning Repository or StatLog. They were transformed to a standard format, attributes followed by outputs, and incomplete examples were removed. Sometimes a dataset contains both training and test examples; they are combined in one file, training examples followed by test examples. Note that attributes are not scaled. Users should scale the attributes after the dataset has been split into training and test parts, if they do so.

Binary Datasets

DatasetDNOrig.Comments
australian*14690 StatLog see also UCI
breast-cancer9683 UCI aka wisconsin, ID# removed, 16 incomplete examples removed
cleveland*13297 UCI converted from the 5-class dataset, 6 incomplete examples removed
diabetes8768 UCI aka pima-indians
german241000 StatLog see also UCI
heart*13270 StatLog see also UCI, a subset of cleveland
ionosphere34351 UCI  
sonar60208 UCI  
votes8416435 UCI  
wdbc30569 UCI ID# removed
D: # of attributes; N: # of training examples

* Be aware of some categorical attributes.

Multiclass Datasets

DatasetKDNNTOrig.Comments
dna318020001186 StatLog i.e., UCI/splice with 4 ambiguous examples removed
glass69214  UCI class 4 is absent
iris34150  UCI  
letter2616160004000 UCI StatLog has different order and training/test split
pendigits101674943498 UCI  
satimage63644352000 StatLog see also UCI, class 6 is absent
segment7182310  StatLog see also UCI
shuttle794350014500 StatLog see also UCI
vehicle418846  StatLog see also UCI
vowel1110528462 UCI  
wine313178  UCI  
K: # of classes; D: # of attributes; N: # of training examples; NT: # of test examples