**The k-NN Classifier Algorithm**

Given a training set X_train with labels y_train, and given a new instance x_test to be classified:

The k-NN Classifier simply memorises the entire training set. And then to classify a new instance does 3 steps.

- First, it finds the k-Nearest most similar instances to the new instance in the training set.
- Then it gets the labels of those training instances.
- Predicts the label of the new instance as a function of the nearby training labels typically by a simple majority vote.

Here’s how a k-Nearest Neighbour Classifier using only one nearest neighbour, that is with k equal to 1, makes these predictions for the simple binary synthetic dataset. Here we’re applying the nearest neighbours classifier to our simple binary classification problem. Where the points in class zero are labelled with yellow dots and the points in class one are labelled with black dots.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
import matplotlib.patches as mpatches import matplotlib.pyplot as plt from sklearn import neighbors import numpy def plot_two_class_knn(X, y, n_neighbors, weights, X_test, y_test): X_mat = X y_mat = y # Create color maps cmap_light = ListedColormap(['#FFFFAA', '#AAFFAA', '#AAAAFF','#EFEFEF']) cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000']) clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights) clf.fit(X_mat, y_mat) # Plot the decision boundary by assigning a color in the color map # to each mesh point. mesh_step_size = .01 # step size in the mesh plot_symbol_size = 50 x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1 y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1 xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, mesh_step_size), numpy.arange(y_min, y_max, mesh_step_size)) Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure() plt.pcolormesh(xx, yy, Z, cmap=cmap_light) # Plot training points plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) title = "Neighbors = {}".format(n_neighbors) if (X_test is not None): train_score = clf.score(X_mat, y_mat) test_score = clf.score(X_test, y_test) title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score) patch0 = mpatches.Patch(color='#FFFF00', label='class 0') patch1 = mpatches.Patch(color='#000000', label='class 1') plt.legend(handles=[patch0, patch1]) plt.xlabel('Feature 0') plt.ylabel('Feature 1') plt.title(title) plt.show() from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000']) # synthetic dataset for classification (binary) plt.figure() plt.title('Sample binary classification problem with two informative features') X_C2, y_C2 = make_classification(n_samples = 100, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, flip_y = 0.2, class_sep = 0.5, random_state=0) plt.scatter(X_C2[:, 0], X_C2[:, 1], c=y_C2, marker= 'o', s=50, cmap=cmap_bold) plt.show() # more difficult synthetic dataset for classification (binary) # with classes that are not linearly separable X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2, centers = 8, cluster_std = 1.3, random_state = 4) y_D2 = y_D2 % 2 plt.figure() plt.title('Sample binary classification problem with non-linearly separable classes') plt.scatter(X_D2[:,0], X_D2[:,1], c=y_D2, marker= 'o', s=50, cmap=cmap_bold) plt.show() |

1 2 3 4 5 |
X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, random_state=0) plot_two_class_knn(X_train, y_train, 1, 'uniform', X_test, y_test) plot_two_class_knn(X_train, y_train, 3, 'uniform', X_test, y_test) plot_two_class_knn(X_train, y_train, 11, 'uniform', X_test, y_test) |

If you run this code and compare the resulting training and test scores for k equals 1, 3, and 11, which are shown in the title of each plot, you can see the effect of model complexity on a models ability to generalize. Here we are also showing how the entire feature space is been broken up into different decision regions according to predictions that the k-NN classifier would make at each point in the decision space.

You can see that the one nearest neighbours classifier is over-fitting the training data in this case. It’s trying to get correct predictions for every single training point while ignoring the generalised trend between the two classes.

In the k = 1 case, the training score is a perfect 100%. But the test score is only 64%. As k increases to 3, the training score drops to 80% but the test score rises slightly to 72%, indicating the model is generalizing better to new data. When k = 11, the training score drops a bit further to 73%, but the test score even better at 80%, indicating that this simple model is much more effective at ignoring minor variations in training data.

The best k-NN model is the sweet spot where test set score is maximum.

The k-Nearest Neighbour Classifiers can be applied to any number of classes, not just 2. For further parameters of k-NN and their effects on decision models please visit sklearn k-NN documentation.

The nearest neighbours approach isn’t useful just for classification. You can use it for regression too. In our next blog, we will see the effects of k-NN regression modelling. Happy classification 🙂

## Recent Comments