|
Mushroom dataset
Task: classification
Number of instances: 8124
Number of attributes: 21 (categorical)
Type of attribute to be predicted: categorical with 2 classes
Download the data: Mushroom Dataset
The following description is drawn from the UCI Machine Learning Repository: this data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class
was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom;
no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.
Sources: The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf.
Model with 1 variable
The simplest model uses only one explanatory variable, the odor :
* If
(odor is a) then (Class
is rather edible)
* If
(odor is not n) then (Class
is rather poisonous)
* If
(odor l) then (Class
is rather edible)
This model enables to correctly classify 8004 out of the 8124 data of the sample (98.5%).
Model with 2 variables
Further precision can be obtained by using a second variable in the model, the spore-print-color:
* If
(odor is not l) then (Class
is rather poisonous)
* If
(odor is a) then (Class
is rather edible)
* If
(odor is n) and
(spore-print-color is not r) then (Class
is rather edible)
This model enables to correctly classify 8076 out of the 8124 data of the sample (99.4%).
Model with 3 variables
The third variable that improves the precision of the model is the stalk-surface-below-ring. We can notice the similarity with the model with 2 variables.
* If
(odor is not l) then (Class
is rather poisonous)
* If
(odor is a) then (Class
is rather edible)
* If
(odor is n) and
(stalk-surface-below-ring is not y) and
(spore-print-color is not r) then (Class
is rather edible)
This model enables to correctly classify 8100 out of the 8124 data of the sample (99.7%).
Model with a full classification (5 variables)
To classify correctly 100% of the 8124 instances of the dataset, we finally need 5 variables:
* If
(bruises is not f) and (odor is not l) and (gill-size is n) then (Class
is rather poisonous)
* If
(odor is a) then (Class
is rather edible)
* If
(odor is n) and
(stalk-surface-below-ring is not y) and
(spore-print-color is not r) then (Class
is rather edible)
|