Before the lab
Read
The
k
-Nearest neighbour algorithm
(KNN) is a famous machine learning or pattern classification
algorithm. In this algorithm you are given a training set of
numerical measurement vectors (e.g. lab attendance and height) each
labelled with a class (e.g. final grade). Given a new, unlabelled
measurement, the algorithm predicts the corresponding class by looking
at the k
-closest training vectors, and voting.
In this lab we will work with a famous set of measurements of iris flowers, where the "class" is the species of flower.
You need to know the following definition from linear algebra
distance(X,Y) = norm(X-Y)
Complete the following function which returns the row indices of the
k
closest rows of data
to v
. In our data sets, we will assume
the first column is always the class label, and should ignored in
distance calculations.
function out = nearest(v, k, data)
for i=1:rows(data)
dist(i)=
endfor
[sorted, indices]=sort(dist);
out = sort(indices(1:k));
endfunction
%!test
%! v = [0,1,1]
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1]
%! assert(nearest(v,1,data) == 4)
%! assert(nearest(v,3,data) == [2,3,4])
You can time your nearest neighbour query with the following:
~/cs2613/labs/L20/bench.m
with the following contentload training
load testing
timeit(1000,@nearest,testing(1,:), 3, training);
In practice the function nearest
would probably be combined with
something like the following; they are split out here to help with
debugging and testing. Use the parameter fun
here rather than
calling nearest
directly, so we can swap in a different query
function later.
function out=findclass(v, k, data, fun=@nearest)
selection=
out = mode(data(selection,1));
endfunction
%!test
%! v = [0,1,1];
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(findclass(v,1,data) == 1)
%! assert(findclass(v,3,data) == 0)
Here is a for-loop based version of a kNN classifier. The return value is the success rate of the classifier (i.e. the fraction of the time it gets the right answer on the test set).
function ratio=knn(k,training, testing, output=0)
correct = 0;
for i=1:rows(testing)
row=testing(i,:);
if findclass(row,k,training) == row(1);
correct += 1;
endif
endfor
ratio = correct/rows(testing);
endfunction
You can try out the classifier with
load testing
load training
knn(3,testing,training)
nearest
We need a couple of ideas from basic linear algebra to make this vectorization
work. We can make many copies of the input vector v
by multiplying
it by a ones column vector on the left (no loop required). Try this in
octave
ones(10,1)*[1,2,3,4]
As a convenience, we can also sort by the square of the norm, which is simply the dot product of a vector with itself.
function out = nearest2(v, k, data)
diff = data( ) - ones( ,1)*v(2:end);
normsq = dot( )
[sorted, indices]=sort(normsq);
out = sort(transpose(indices(1:k)));
endfunction
%!test
%! v = [0,1,1];
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(nearest2(v,1,data) == 4)
%! assert(nearest2(v,3,data) == [2,3,4])
Finally, check the speedup with this second knn frunction
function ratio=knn2(k,training, testing, output=0)
correct = 0;
for i=1:rows(testing)
row=testing(i,:);
if findclass(row,k,training,@nearest2) == row(1);
correct += 1;
endif
endfor
ratio = correct/rows(testing);
endfunction
You can use the follow lines in bench.m
timeit(10,@knn,3,training,testing);
timeit(10,@knn2,3,training,testing);