1. Introduction:

K-Nearest Neighbor or KNN is a supervised machine learning algorithm used for both Classification and Regression tasks. However, most often it is used for classification problems. It is a very useful algorithm in the case of small datasets, as it requires a lot of time and memory.

2. Properties of KNN:

  • Lazy learning algorithm − KNN is a lazy learning algorithm because it does not explicitly learn the model, but it saves all train data as it is in the training phase and at the testing phase, it uses whole train data for prediction. This means the training process is very fast, but the real concern is with large memory consumption to save all training data. And time complexity at testing phase because it requires to run through all train data
  • Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data

3. Steps of KNN algorithm:

KNN algorithm uses feature similarity to predict the values of new data points, which means that the new data point will be assigned a value based on how closely it matches the points in the training set. Below are the steps of KNN in the case of Classification and Regression:

Step 1 − The first step is to load the training as well as test data

Step 2 − Next, we need to choose the value of K, i.e. the nearest data points. K can be any integer

Step 3 − For each point in the test data, do the following:

  • 3.1 − Calculate the distance between test data and each row of training data with the help of the Euclidean, Manhattan, Minkowski, or Hamming distance. The most commonly used method to calculate distance is Euclidean
  • 3.2 − Now, based on the distance value, sort them in ascending order
  • 3.3 − Next, it will select the top K rows from the sorted array
  • 3.4 − Now, based on the type of model-Regression/Classification, follow the below next step:
  • 3.4.1 — In the case of Classification, assign a class to the test point which is equal to the majority vote or most frequent class of the K-selected rows. If K is even, then majority votes can result in ties, it is advised to avoid even value for K
  • 3.4.2 — In the case of Regression, assign a value to the test point which is equal to the mean of the K-selected rows.

4. Distance metrics options:

Distance metrics used in KNN for Regression and Classification problems are below:

  • Euclidean distance

Euclidean distance is the square root of the sum of squared distance between two points. It is also known as the L2 norm.

  • Manhattan distance

Manhattan distance is the sum of the absolute values of the differences between two points. It is also known as the L1 norm.

  • Minkowski Distance

Minkowski distance is used to find distance similarity between two points. When p=1, it becomes Manhattan distance, and when p=2, it becomes Euclidean distance.

Minkowski distance between 2 points (X1, Y1) and (X2, Y2)
  • Hamming Distance

Hamming distance is used for categorical variables. In simple terms, it tells us if the two categorical variables are the same or not.

Hamming distance between categorical data: XOR

5. KNN hyperparameter — K:

K is a hyperparameter in classification and Regression KNN, which is the number of neighbors to consider while predicting the class/value of a query data-point.

Depending on the value of K, the KNN classifier selects the nearest K data points and assigns the query data point the class which is equal to the majority vote of the K neighbors. In the below example, the query point is assigned class B when K=1 and class A when K=7.

KNN on classification data

In the case of Regression data, KNN Regressor will assign the value to the query data-point which is equal to the average of K-nearest data points:

KNN on Regression data

K value determines the decision boundary between classes:

Decision boundary in case of Classification data when using KNN Classifier and different K

In the above image of KNN classifier (true in case of KNN Regressor as well):

K=1 → decision curve is non-smooth on train data → extremely high accuracy on train data → low accuracy on test data → OverfitHigh Variance

K = Neither too large nor too small → Neither overfit nor under fit → Generating smooth decision surface → less prone to noise.

K=n (number of train data points) → Not generating any decision surface → classifying all query points with majority class → Under fitHigh Bias

6. How to select the value of K:

In KNN, finding the value of K is not easy. A small value of K means that noise will have a higher influence on the result, and a large value makes it computationally expensive. K is a crucial parameter in the KNN algorithm. Some suggestions for choosing K Value are:

  1. It is recommended to choose an odd number for K if the number of classes is 2 and another simple approach to select K=sqrt(n) where n = number of data points in the training dataset.
  2. Domain knowledge is very useful in deciding the K value.
  3. Using error curves: The figure below shows error curves for different values of K for training and test data. At low K values, there is overfitting of data/high variance. Therefore, test error is high, and train error is low. At K=1 in train data, the error is always zero, because the nearest neighbor to that point is that point itself. Therefore, through training error is low, test error is high at lower K values. This is called overfitting. As we increase the value for K, the test error is reduced. But after a certain K value, bias/underfitting is introduced and test error goes high. So we can say initially test data error is high(due to variance) then it goes low and stabilizes and with further increase in K value, it again increases(due to bias). The K value when test error stabilizes and is low is considered as optimal value for K. From the above error curve, we can choose K=8 for our KNN algorithm implementation.

7. Data Preparation for KNN:

  1. Data Scaling: To locate the data point in multidimensional feature space, it would be helpful if all features are on the same scale. Hence, normalization or standardization of data will help
  2. Dimensionality Reduction: KNN may not work well if there are too many features. Hence, dimensionality reduction techniques like feature selection, principal component analysis can be implemented
  3. Missing value treatment: If out of M features, one feature data is missing for a particular example in the training set, then we cannot locate or calculate distance from that point. Therefore, deleting that row or imputation is required

8. KNN for Classification:

In the python implementation of KNN Classifier, we generate data using sklearn.make_classification and split the data into train and test:

Generate Classification data and split into train and test

Now scale the input features:

Scale the input features X

Now, find the value of K, for which we get maximum Accuracy or lowest error:

Find the best value of K

The output of the above code is:

Accuracy value for different K

We can see that at K=7, we get maximum accuracy of 0.96, and that's the best value of K.

9. KNN for Regression:

In the python implementation of KNN Regressor, we generate data using sklearn.make_regression and split the data into train and test:

Generate Regression data and split into train and test

Distance-based algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because they are using distances between data points to determine their similarity and hence perform the task at hand. Therefore, we scale our data before employing a distance-based algorithm so that all the features contribute equally to the result:

Normalization of input features X

Now, find the value of K, for which we get minimum Root Mean Square Error:

Find K that gives min value of RMSE

Below is the output for different values of K, we can see that at K=3 RMSE value is minimum:

The output of RMSE for different values of K

Below is the plot for different values of K and RMSE:

We can also apply GridSearchCV to find the best value of parameter K:

Below is the output of the above code:

10. Pros of using KNN:

  • This algorithm is very easy to understand and interpret
  • It is very useful for nonlinear data because there is no assumption about data in this algorithm
  • It is a versatile algorithm, as we can use it for classification as well as regression
  • Furthermore, it has relatively high accuracy, but there are much better-supervised learning models than KNN

11. Cons of using KNN:

  • It is computationally a bit expensive algorithm because it stores all the training data
  • High memory storage is required as compared to other supervised learning algorithms.
  • Prediction is slow in the case of large data
  • It is very sensitive to the scale of data as well as irrelevant features and outliers

12. Applications of KNN:

The following are some areas in which KNN can be applied successfully:

  • Banking System: KNN can be used in banking systems to predict whether an individual is fit for loan approval. Does that individual have characteristics similar to the defaulter's one?
  • Calculating Credit Ratings: KNN algorithms can be used to find an individual’s credit rating by comparing it with the persons having similar traits
  • Politics: With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not vote”, “Will Vote to Party A”, “Will Vote to Party B”

Other areas in which the KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition, and Video Recognition.

13. References:

[1] https://www.youtube.com/watch?v=wTF6vzS9fy4

[2] https://www.youtube.com/watch?v=HVXime0nQeI

[3] https://www.analyticsvidhya.com/blog/2021/04/simple-understanding-and-implementation-of-knn-algorithm/

--

--

Heena Sharma

Data Scientist@Reltio, expert in ML, DL, NLP, and AI, passionate about using cutting-edge tech to solve real-world problems and drive success.