SciKit Learn Decision Tree Classifier

Decision Tree Classifier is a type of class that is capable of performing the classification of multiple classes in a dataset. Decision Tree classifier takes two arrays as inputs. One is an Array X that is dense or sparse, of the size [n_samples, n_features] holding the samples for training. Another one is an Array Y of the values of an integer, size [n_samples] holding the training class labels samples. Decision Tree Classifier can handle both multiclass and binary classifications. SciKit Learn Decision Tree Classifier operates in the following way.

  • The model of the Decision Tree Classifier is fitted to predict the class of the samples.
  • Alternatively, each class probability can also be predicted that is the fraction of the same class training samples in a leaf
  • A tree can be constructed by implementing an Iris dataset
  • Once the class is trained, the tree can be plotted with the help of the plot_tree function
  • The tree can also be exported using export_graphviz exporter.
  • If the user is employing conda package manager, then the python package and the Graphviz binaries are installed with conda installed python-graphviz
  • Another way to use Graphviz binaries is to download them from the Graphviz home page
  • The python wrapper is installed from the pypi with pip install Graphviz
  • The exporter of export_graphviz supports different types of aesthetic options that also include the node class, explicit variables and names of the class if required.
  • Alternatively, the tree can be also exported in the format of text with the function export_text. This method is much more compact and does not need the external libraries installation.

The root nodes partition the data with the help of 2 measures. They are namely Entropy and Gini Impurity.

SciKit Learn Decision Tree Classifier

Entropy: Entropy is the measure through which the root nodes partition the data implementing the feature that offers the gain of the maximum information. Information gain indicates how significant the given attribute of the featured vectors is. It is calculated as

Information Gain= entropy (parent) – [average entropy (children)] where entropy is the common measure for the impurities of the target class.

Gini Impurity: Gini impurity is another measure that is computationally faster as it does not need any logarithmic functions. It is referred to as the criteria to minimize the misclassification probability.

In reality, the implementation of both methods does not make it much of a difference. Thus we can see that Python’s popular Scikit Learn Library is used in the decision trees for both regression and classification tasks. Being fairly simple in the algorithm itself, implementation of decision tree classification is even easier with SciKit Learn.