Discriminative Classification

Decision Trees and Random Forests


This page can be downloaded as interactive jupyter notebook


In this notebook, we implement two discriminative classification models - Decision Tree and Random Forest - using Python and the scikit-learn library. We start with the decision trees, since a random forest is basically a combination of multiple decision trees and thus builds up on those.

A decision tree can be seen as a graph structure, where each node corresponds to one decision except for the leave nodes that correspond to a class (or a class distribution). We illustrate this with the following example. Assume we have set of red and blue marbles and we want to build a machine that sorts those marbles according to their colors. The machine has one marble input and two outputs (one for each class / colour). We further introduce a decision node that checks the color of each marble and forwards it to the left or the right, depending on the outcome of the check. Our sorting machine is illustrated in the next figure:

simpletree.png

For simplicity, we change to a lighter way of illustrating the decision trees. Depicted as a graph our decision tree would look like this:

tree1.png

Note that we have two types of nodes: decision nodes and leaf nodes (shaded). Leaf nodes indicate, that we assign the stored class to any marble that reaches this position. To make the task a little more complex, we add a third class: green marbles (G). In order to predict this class, we modify the tree:

tree2.png

The tree has now a maximum depth of two, since two checks are required to identify a green or blue marble. Note that the ordering and type of checks is not unique. E.g. first checking if a marble is green and if not checking for red would also be valid.

In general a decision tree contains of an arbitrary number of decisions, classifying objects with arbitrary attributes. In the above example we only checked for a certain colour, but in general any operation with a boolean outcome (True or False) could be used as a decision node.

The training procedure will not be covered in detail here, but in general a set of training data is used to determine well separating decision criteria. Usually for each node multiple random criteria are used and the one which results in the highest information gain is kept.

Preparation

In order to implement the methods, we import the required Python modules:

import numpy as np                                   # Used for numerical computations
import matplotlib.pyplot as plt                      # Plotting library  
from sklearn.ensemble import RandomForestClassifier

# This is to set the size of the plots in the notebook
plt.rcParams['figure.figsize'] = [6, 6]    

Creating a Toy Dataset

Next, we will create a toy dataset. It will contain samples which are drawn from 3 normal distributions, where each distribution represents a class.

num_c = 3 # Number of clusters
colours = [[255, 170, 170], [170, 255, 170], [170, 170, 255]]

# Generate the samples (3 clusters), we set a fixed seed make the results reproducable
np.random.seed(0)
c1_samples = np.clip([(20, 20) + np.random.randn(2)*15 for i in range(100)], 0, 100)
c2_samples = np.clip([(70, 30) + np.random.randn(2)*15 for i in range(100)], 0, 100)
c3_samples = np.clip([(30, 70) + np.random.randn(2)*15 for i in range(100)], 0, 100)

# Plot the samples, colored by class
plt.scatter(*zip(*c1_samples), c='red', marker='o', label='Samples of class 1')
plt.scatter(*zip(*c2_samples), c='green', marker='o', label='Samples of class 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='o', label='Samples of class 3')
plt.legend()
plt.show()

png

We stack all data samples to one matrix and append the class index as additional column. Each row in the resulting matrix contains one the x and y coordinates and the class index of one sample.

c1_array = np.hstack((np.array(c1_samples), 1 * np.ones((len(c1_samples), 1))))
c2_array = np.hstack((np.array(c2_samples), 2 * np.ones((len(c2_samples), 1))))
c3_array = np.hstack((np.array(c3_samples), 3 * np.ones((len(c3_samples), 1))))

all_samples = np.vstack((c1_array, c2_array, c3_array))
print('Shape of stacked sample matrix:', all_samples.shape)
Shape of stacked sample matrix: (300, 3)

Decision Tree

Now that we have a data set we continue with the implementation of a decision tree in Python. We implement everything from scratch only using numpy for working with numerical arrays. We start with modelling decision criterions. For simplicity, in each decision either the x or y coordinate is compared to a threshold e.g. x > 50 or y < 25. We therefore can model any criterion with a tuple of three attributes:

  • greater: boolean variable that indicates our operator (if true the operator is > if False it’s <)
  • threashold: the threshold value to compare with
  • feature: denotes the feature index (in this case either 0 for x or 1 for y)

Using this model, we can implement the following two functions. The first one will create a random criterion (used during training), the second one evaluates a criterion and returns the boolean result. Note that for simplicity the range of the random threshold is already adjusted to the toy data set.

def get_random_criterion():
    greater = np.random.random() > 0.5
    threshold = np.random.random() * 100   # here we use knowledge about the data set!
    feature = np.random.randint(0, 2)      # here we use knowledge about the feature dimensionality!
    return [greater, threshold, feature]
    
def eval_criterion(criterion, vector):
    g, t, f = criterion
    return (vector[f] < t) ^ g

We further need two helper functions that compute the entropy of a data set and the information gain of a criterion with respect to a set of features. Information gain is defined as

$\Delta E = \dfrac{N_1}{N_1+N_2}\cdot \sum\limits_k P_1(L^k)\cdot log_2 [P_1(L^k)] + \dfrac{N_2}{N_1+N_2}\cdot \sum\limits_k P_2(L^k)\cdot log_2 [P_2(L^k)] $

where the index denotes the subsets 1 and 2 that result from a criterion. The sums correspond to the entropy of the subsets histogram.

def get_entropy(data):
    class_counts = np.bincount(data[:,2].astype(np.int))
    P = class_counts / np.sum(class_counts)
    P = P[P > 0]
    return np.sum(P * np.log2(P))
    
def get_inf_gain(criterion, data):
    outcomes = np.array([eval_criterion(criterion, vector) for vector in data])
    set1 = data[outcomes]
    set2 = data[np.logical_not(outcomes)]
    N1, N2 = len(set1), len(set2)
    return N1/(N1+N2)*get_entropy(set1) + N2/(N1+N2)*get_entropy(set2)

Next, we implement the nodes.

class Node():
    n_id = 1
    def __init__(self, leaf):
        self.is_leaf = leaf
        self.criterion = None
        self.successor_yes = None
        self.successor_no = None
        self.label = None
        self.id = Node.n_id
        Node.n_id += 1
        self.is_trained = False
        self.train_samples = []
        
    def set_successor(self, node, yes_branch):
        if yes_branch: self.successor_yes = node
        else: self.successor_no = node
        
    def predict(self, vector):
        if eval_criterion(self.criterion, vector):
            return self.successor_yes
        else:
            return self.successor_no
    
    def train(self):
        if self.is_leaf or all(self.train_samples[:,2]==self.train_samples[0,2]):
            self.is_leaf=True
            self.label = np.argmax(np.bincount(self.train_samples[:,2].astype(np.int)))
            self.truncate()
        else:
            criteria = [get_random_criterion() for i in range(1000)]
            scores = np.array([get_inf_gain(criterion, self.train_samples) for criterion in criteria])
            self.criterion = criteria[np.argmax(scores)]
        self.is_trained = True
        
    def truncate(self):
        
        if self.successor_yes:
            self.successor_yes.truncate()
        if self.successor_no:
            self.successor_no.truncate()
        self.is_trained=True
        
    def __repr__(self):
        if self.is_leaf:
            return('N{} (Leaf-Node)'.format(self.id))
        else:
            return('N{} (yes -> N{} / no -> N{})'.format(self.id,self.successor_yes.id,self.successor_no.id))

Finally we combine everything in the class DecisionTree.

class DecisionTree():
    def __init__(self, depth=3):
        self.root_node = Node(leaf = False)
        self.all_nodes = [self.root_node]
        self.layers = [[self.root_node]]
        for nl in range(1, depth):
            layer = []
            num_nodes = 2**nl
            last_layer = (nl ==(depth-1))
            for n in range(num_nodes):
                node = Node(last_layer)
                self.layers[nl-1][n//2].set_successor(node, not n%2)
                layer.append(node)
                self.all_nodes.append(node)
            self.layers.append(layer)
        
    def predict(self, vector):
        node = self.root_node
        while not node.is_leaf:
            node = node.predict(vector)
        return int(node.label)
    
    def train(self, data, verbose=False):
        while not all([node.is_trained for node in self.all_nodes]):
            # Forward training samples
            for node in self.all_nodes:
                node.train_samples = []
            for vector in data:
                node = self.root_node
                while True:
                    if node.is_trained:
                        if node.is_leaf: break
                        node = node.predict(vector)
                    else:
                        node.train_samples.append(vector)
                        break
            # Train all nodes that are not trained and have samples
            for node in self.all_nodes:
                if (not node.is_trained) and len(node.train_samples):
                    node.train_samples = np.array(node.train_samples)
                    if verbose: print('training node', node)
                    node.train()
DT = DecisionTree()
for i, layer in enumerate(DT.layers):
    print('Layer {}: {}'.format(i, layer))
Layer 0: [N1 (yes -> N2 / no -> N3)]
Layer 1: [N2 (yes -> N4 / no -> N5), N3 (yes -> N6 / no -> N7)]
Layer 2: [N4 (Leaf-Node), N5 (Leaf-Node), N6 (Leaf-Node), N7 (Leaf-Node)]
DT.train(all_samples, verbose=True)
print('training done')
training node N1 (yes -> N2 / no -> N3)
training node N2 (yes -> N4 / no -> N5)
training node N3 (yes -> N6 / no -> N7)
training node N4 (Leaf-Node)
training node N5 (Leaf-Node)
training node N6 (Leaf-Node)
training node N7 (Leaf-Node)
training done

Evaluation

To visualize the decision boundaries we predict every feature of the feature space.

def predict_feature_space_dt(model):
    label_map = np.zeros((100, 100, 3), dtype=np.ubyte)
    for x in range(100):
        for y in range(100):
            label = model.predict(np.array((x, y), dtype=np.float))
            label_map[y, x] = colours[label-1]
    return label_map
label_map_dt = predict_feature_space_dt(DT)

plt.imshow(label_map_dt)
plt.scatter(*zip(*c1_samples), c='red', marker='o', label='Samples of class 1')
plt.scatter(*zip(*c2_samples), c='green', marker='o', label='Samples of class 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='o', label='Samples of class 3')
plt.title('DT (3 layer)'); plt.legend(); plt.show()

png

To understand the influence of the depth, we train a second tree with a depth of 7.

DT7 = DecisionTree(depth=7)
DT7.train(all_samples)

And plot the decision boundaries.

label_map_dt7 = predict_feature_space_dt(DT7)

plt.imshow(label_map_dt7)
plt.scatter(*zip(*c1_samples), c='red', marker='o', label='Samples of class 1')
plt.scatter(*zip(*c2_samples), c='green', marker='o', label='Samples of class 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='o', label='Samples of class 3')
plt.title('DT (7 layer)'); plt.legend(); plt.show()

png

We can observe, that by increasing the number of layers the tree starts to add thin areas that correspond to outliers in the training data. This means that the model is overfitting to the training data. The choice of depth is maybe the most important parameter of a decision tree.

Random Forest

Since decision trees are prone to overfitting, as seen in the previous example, the idea of a random forest is to train multiple trees and use them jointly as an ensemble of classifiers in order to create a more robust classification model. To train a random forest, we usually train each tree with a different subset of the available training data resulting in multiple, different trees. A common way to classify a new data sample is to predict its class with each tree separately and then assign the class with the most votes.

This time we will use the scikit-learn implementation.

# set number of trees
nt = 250
cr = 'entropy'
depth = 7

# Generate the RF classifier
rf = RandomForestClassifier(n_estimators=nt, criterion=cr, max_depth=depth, bootstrap = True, n_jobs = -1)

# Train the classifier
rf.fit(all_samples[:,:2], all_samples[:,2])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=7, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Evaluation

Again, we classify each feature in the feature space to visualize the decision boundaries..

def predict_feature_space_rf(model):
    label_map = np.zeros((100, 100, 3), dtype=np.ubyte)
    feature_space = []
    for x in range(100):
        for y in range(100):
            feature_space.append((x,y))
    feature_space = np.array(feature_space)
    labels = model.predict(feature_space)
    for x in range(100):
        for y in range(100):
            label = labels[x*100+y]
            label_map[y, x] = colours[int(label)-1]
    return label_map
label_map_rf = predict_feature_space_rf(rf)

plt.imshow(label_map_rf)
plt.scatter(*zip(*c1_samples), c='red', marker='o', label='Samples of class 1')
plt.scatter(*zip(*c2_samples), c='green', marker='o', label='Samples of class 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='o', label='Samples of class 3')
plt.title('RF (250 trees)'); plt.legend(); plt.show()

png

We see that, although the maximum depth is set very high, we do not observe strong overfitting.

A further benefit of random forests is, that it is possible to derive probabilities from the voting statistics. In the next example we visualize the predicted probabilities instead of the final class label for the complete feature space

def predict_feature_space_rf_prob(model):
    label_map = np.zeros((100, 100, 3), dtype=np.ubyte)
    feature_space = []
    for x in range(100):
        for y in range(100):
            feature_space.append((x,y))
    feature_space = np.array(feature_space)
    probas = model.predict_proba(feature_space)
    for x in range(100):
        for y in range(100):
            proba = probas[x*100+y]
            label_map[y, x] = np.sum(colours*proba, axis=1).astype(np.ubyte)
    return label_map
label_map_rf = predict_feature_space_rf_prob(rf)

plt.imshow(label_map_rf)
plt.scatter(*zip(*c1_samples), c='red', marker='o', label='Samples of class 1')
plt.scatter(*zip(*c2_samples), c='green', marker='o', label='Samples of class 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='o', label='Samples of class 3')
plt.title('RF (100 trees)'); plt.legend(); plt.show()

png

Discussion

Decision trees and random forests are both frequently used models. This is due to the fact that both training and evaluation is rather fast and tuning such a model is quite easy. We saw, that training a random forest requires only a few lines of code, when using the scikit-learn implementation.


Author: Dennis Wittich
Last modified: 09.05.2019