Non Parametric Models

KNN (K-Nearest Neighbors)

The k-Nearest Neighbors (k-NN) algorithm is a type of supervised machine learning algorithm used for both classification and regression tasks. However, it’s more commonly used for classification purposes. The “k” in k-NN represents a number specified by the user, and it refers to the number of nearest neighbors in the data that the algorithm will consider to make a prediction for a new data point.

How It Works

  1. Choose the number of k and a distance metric: First, you decide on the number of neighbors, “k”, and the method for measuring distance between data points (common metrics include Euclidean, Manhattan, and Hamming distance).

  2. Find the k-nearest neighbors: For a given data point that you want to classify or predict its value, the algorithm identifies the k nearest data points in the training dataset based on the distance metric.

  3. Make predictions:

    • For classification, the algorithm assigns the class to the new data point based on the majority vote of its k nearest neighbors.

    • For regression, it predicts the value for the new point based on the average (or another aggregate measure) of the values of its k nearest neighbors.


Imagine you have a dataset of fruits, where each fruit is described by two features: weight and color (let’s simplify color to a numerical value for the purpose of this example, where 1 = green, 2 = yellow, 3 = red), and you’re trying to classify them as either “Apple” or “Banana”.

Now, you have a new fruit that you want to classify, and this fruit weighs 150 grams and has a color value of 1 (green).

If you choose k=3 (looking at the three nearest neighbors), and the three closest fruits in your dataset to this new fruit are:

  • Fruit 1: 145 grams, color 1, labeled “Apple”

  • Fruit 2: 160 grams, color 2, labeled “Apple”

  • Fruit 3: 155 grams, color 1, labeled “Banana”

Based on the majority vote among the nearest neighbors, the algorithm would classify the new fruit as an “Apple” because two out of three nearest neighbors are labeled as “Apple”.

This example simplifies the concept to make it easier to understand. In practice, the k-NN algorithm can handle datasets with many more features and more complex decision boundaries. The key takeaway is that the k-NN algorithm relies on the similarity of the nearest observations in the feature space to make predictions.