K-Nearest Neighbors Algorithm in Python

K-nearest neighbors algorithm is a machine learning algorithm which is used for classification and regression, and it can be applied where other machine learning algorithms are applied. See the typical problem setting ,

Suppose (x1, y1), (x2, y2), (x3, y3), ……………….(xn, yn) pairs in which x’s are attributes and y’s are labels, suppose another instance comes without having any label i.e (xn+1, Unknown ) you can see that there is not label. The machine learning classification task is to predict the unknown label of instance (xn+1, Unknown).

Contents hide

1 Data Set Description-

2 Working of K-Nearest Neighbors Algorithm-

3 Python Program of K-Nearest Neighbors-

4 Program Description-

5 Output of the program when k=3

6 Output of the program when k=5

7 See Also:

Data Set Description-

I have taken portion of Irish dataset in which there are four features sepal length, sepal width, petal length and petal width of a flower based on that features a flower is labeled as Irish-setosa, Irish-versicolor and Irish-virginica. There are 10 instances in the table in which 9 instances having labels but the last one has not any label that is Unknown. The task is to predict the unknown label based on 9 instances.

Working of K-Nearest Neighbors Algorithm-

It works on the voting system in which majority wins, suppose there are 9 instances( records) having labels but instance no. 10 has not any label.

In addition, suppose If you decide that 5 neighbors ( Classes) will vote, here voting means calculating distance from all instances and taking 5 instances having least distance. Furthermore, in 5 instances you will see that which class having majority, then unknown class will be labeled by that class.

For the calculation of distance I have used Euclidean distance formula for four dimensions.

From the data set you can see that instances 1, 2 and 3 belong to Irish-setosa, 4, 5 and 6 Irish-versicolor and 7, 8 and 9 Irish-virginica.

If you take 3- nearest neighbors then you will see that 1, 2, and 3 having least distance in which (1, 2, 3) are Irish-setosa.

So, you can say that Unknown label is Irish-satosa.

If you take 5- nearest neighbors then you will see that 1, 2, 3, 4 and 5 are having least distance in which (1, 2, 3) are Irish-setosa 4 and 5 are Irish-vericolor.

So, you can say that Unknown label is Irish-satosa.

For 7-nearest neighbors you can not decide because 3 are Irish-satosa, 3 are Irish-versicolor and one is Irish-virginica.

So, you should know that size of data set number of neighbors many factors play very important role for deciding a label.

Python Program of K-Nearest Neighbors-

from sklearn.neighbors import KNeighborsClassifier
i1=[4.6,3.4,1.4,0.3]
i2=[5,3.4,1.5,0.2]
i3=[4.4,2.9,1.4,0.2]
i4=[7,3.2,4.7,1.4]
i5=[6.4,3.2,4.5,1.5]
i6=[6.9,3.1,4.9,1.5]
i7=[6.3,3.3,6,2.5]
i8=[5.8,2.7,5.1,1.9]
i9=[7.1,3,5.9,2.1]
X_train=[i1,i2,i3, i4,i5,i6, i7,i8,i9]
irst, irver,irvirg =”Iris-setosa”, “Iris-versicolor”, “Iris-virginica”
y_train=[irst,irst ,irst ,irver , irver,irver,irvirg,irvirg,irvirg]
modal= KNeighborsClassifier(n_neighbors =3, p=2)
modal.fit(X_train, y_train)
X_pred=[[5,3.3,1.4,0.2]]
Unknown= modal.predict(X_pred)
print(“Unknown=”,Unknown)

Program Description-

For the program instances are taken as i1, i2, i3, i4, i5, i6, i7, i8 and in i9 variables and labels are in irst, irver,irvirg variables. Therefore, two dimensions list X_train behaves like a table, and label is also provided to the program using y_train.

In KNeighborsClassifier() method n_neighbors is for numbers of neighbors and p=2 Eclidean distance fit() method trains the model and predict() method predict the label.

Output of the program when k=3

Label predicted as Irish-setosa

Output of the program when k=5