1. Introduction:

K-Nearest Neighbor or KNN is a supervised machine learning algorithm used for both Classification and Regression tasks. However, most often it is used for classification problems. It is a very useful algorithm in the case of small datasets, as it requires a lot of time and memory.

2. Properties of KNN:

  • Lazy learning algorithm − KNN is a lazy learning algorithm because it does not explicitly learn the model, but it saves all train data as it is in the training phase and at the testing phase, it uses whole train data for prediction. This means the training process is very fast, but the real concern is with large memory consumption to save all training data. And time complexity at testing phase because it requires to run through all train data

3. Steps of KNN algorithm:

KNN algorithm uses feature similarity to predict the values of new data points, which means that the new data point will be assigned a value based on how closely it matches the points in the training set. Below are the steps of KNN in the case of Classification and Regression:

Step 1 − The first step is to load the training as well as test data

Step 2 − Next, we need to choose the value of K, i.e. the nearest data points. K can be any integer

Step 3 − For each point in the test data, do the following:

  • 3.1 − Calculate the distance between test data and each row of training data with the help of the Euclidean, Manhattan, Minkowski, or Hamming distance. The most commonly used method to calculate distance is Euclidean

4. Distance metrics options:

Distance metrics used in KNN for Regression and Classification problems are below:

  • Euclidean distance

Euclidean distance is the square root of the sum of squared distance between two points. It is also known as the L2 norm.

  • Manhattan distance

Manhattan distance is the sum of the absolute values of the differences between two points. It is also known as the L1 norm.

  • Minkowski Distance

Minkowski distance is used to find distance similarity between two points. When p=1, it becomes Manhattan distance, and when p=2, it becomes Euclidean distance.

Minkowski distance between 2 points (X1, Y1) and (X2, Y2)
  • Hamming Distance

Hamming distance is used for categorical variables. In simple terms, it tells us if the two categorical variables are the same or not.

Hamming distance between categorical data: XOR

5. KNN hyperparameter — K:

K is a hyperparameter in classification and Regression KNN, which is the number of neighbors to consider while predicting the class/value of a query data-point.

Depending on the value of K, the KNN classifier selects the nearest K data points and assigns the query data point the class which is equal to the majority vote of the K neighbors. In the below example, the query point is assigned class B when K=1 and class A when K=7.

KNN on classification data

In the case of Regression data, KNN Regressor will assign the value to the query data-point which is equal to the average of K-nearest data points:

KNN on Regression data

K value determines the decision boundary between classes:

Decision boundary in case of Classification data when using KNN Classifier and different K

In the above image of KNN classifier (true in case of KNN Regressor as well):

K=1 → decision curve is non-smooth on train data → extremely high accuracy on train data → low accuracy on test data → OverfitHigh Variance

K = Neither too large nor too small → Neither overfit nor under fit → Generating smooth decision surface → less prone to noise.

K=n (number of train data points) → Not generating any decision surface → classifying all query points with majority class → Under fitHigh Bias

6. How to select the value of K:

In KNN, finding the value of K is not easy. A small value of K means that noise will have a higher influence on the result, and a large value makes it computationally expensive. K is a crucial parameter in the KNN algorithm. Some suggestions for choosing K Value are:

  1. It is recommended to choose an odd number for K if the number of classes is 2 and another simple approach to select K=sqrt(n) where n = number of data points in the training dataset.

7. Data Preparation for KNN:

  1. Data Scaling: To locate the data point in multidimensional feature space, it would be helpful if all features are on the same scale. Hence, normalization or standardization of data will help

8. KNN for Classification:

In the python implementation of KNN Classifier, we generate data using sklearn.make_classification and split the data into train and test:

Generate Classification data and split into train and test

Now scale the input features:

Scale the input features X

Now, find the value of K, for which we get maximum Accuracy or lowest error:

Find the best value of K

The output of the above code is:

Accuracy value for different K

We can see that at K=7, we get maximum accuracy of 0.96, and that's the best value of K.

9. KNN for Regression:

In the python implementation of KNN Regressor, we generate data using sklearn.make_regression and split the data into train and test:

Generate Regression data and split into train and test

Distance-based algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because they are using distances between data points to determine their similarity and hence perform the task at hand. Therefore, we scale our data before employing a distance-based algorithm so that all the features contribute equally to the result:

Normalization of input features X

Now, find the value of K, for which we get minimum Root Mean Square Error:

Find K that gives min value of RMSE

Below is the output for different values of K, we can see that at K=3 RMSE value is minimum:

The output of RMSE for different values of K

Below is the plot for different values of K and RMSE:

We can also apply GridSearchCV to find the best value of parameter K:

Below is the output of the above code:

10. Pros of using KNN:

  • This algorithm is very easy to understand and interpret

11. Cons of using KNN:

  • It is computationally a bit expensive algorithm because it stores all the training data

12. Applications of KNN:

The following are some areas in which KNN can be applied successfully:

  • Banking System: KNN can be used in banking systems to predict whether an individual is fit for loan approval. Does that individual have characteristics similar to the defaulter's one?

Other areas in which the KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition, and Video Recognition.

13. References:

[1] https://www.youtube.com/watch?v=wTF6vzS9fy4

[2] https://www.youtube.com/watch?v=HVXime0nQeI

[3] https://www.analyticsvidhya.com/blog/2021/04/simple-understanding-and-implementation-of-knn-algorithm/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store