K-means Clustering

Keywords: K-means

Original form K-Means algorithm might be one of the most accessible algorithms in machine learning. And many books and courses started with it. However, if we convert the task which K-means dealt with into a more mathematical form, there would be more interesting aspects coming to us.

Clustering Problem[1]

The first thing we should do before introducing the algorithm is to make the task clear. And a precise mathematical form is usually the best way.

Clustering is a kind of unsupervised learning task. So there is no correct nor incorrect solution. Clustering is similar to classification during predicting since the output of clustering and classification is discrete. However, during training classifiers, we always have a certain target corresponding to every input. On the contrary, clustering has no target at all, and what we have is only

{x1,,xN}(1)\{x_1,\cdots, x_N\}\tag{1}

where xiRDx_i\in\mathbb{R}^D for i=1,2,,Ni=1,2,\cdots,N. And our mission is to separate the dataset into KK groups (where KK has been given before task)

An intuitive strategy of clustering based on two considerations:

  1. the distance between data points in the same group should be as small as possible.
  2. the distance between data points in the different groups should be as large as possible.

Basing on these two points, some concepts were formed. The first one is how to represent a group. We take

μi:i{1,2,,K}(2)\mu_i:i\in\{1,2,\cdots, K\}\tag{2}

as the prototype associated with iith group. A group always contains several points, and a spontaneous idea is using the center of all the points belonging to the group as its prototype. To represent which group xi\boldsymbol{x}_i in equation (1) belongs to, an indicator is necessary, and 1-of-K coding scheme is used, where the indicator:

rnk{0,1}(3)r_{nk}\in\{0,1\}\tag{3}

for k=1,2,,Kk=1,2,\cdots,K representing the group number and n=1,2,,Nn = 1,2,\cdots,N denoting the number of sample point, and where rnk=1r_{nk}=1 then rnj=0r_{nj}=0 for all jkj\neq k.

Objective Function

A loss function is a good way to measure the quantity of our model during both training and testing stages. And in the clustering task loss function could not be used for we have no idea about loss. However, we can build another function that plays the same role as loss function and it is also the target of what we want.

According to the two base points above, we build our objective function:

J=n=1Nk=1Krnkxnμk2(4)J=\sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}||\boldsymbol{x}_n-\mu_k||^2\tag{4}

In this objective function, the distance is defined as Euclidean distance(However, other measurements of similarity could also be used). Then the mission is to minimize JJ by finding some certain {rnk}\{r_{nk}\} and {μk}\{\mu_k\}

K-Means Algorithm

Now, let’s represent the famous K-Means algorithm. The method includes two steps:

  1. Minimising JJ respect to rnkr_{nk} keeping μk\mu_k fixed
  2. Minimising JJ respect to μk\mu_k keeping rnkr_{nk} fixed

In the first step, according to equation (4), the objective function is linear of rnkr_{nk}. So there is a close solution. Then we set:

rnk={1 if k=argminjxnμj20otherwise(5)r_{nk}=\begin{cases} 1&\text{ if } k=\mathop{argmin}_{j}||x_n-\mu_j||^2\\ 0&\text{otherwise} \end{cases}\tag{5}

And in the second step, rnkr_{nk} is fixed and we minimise objective function JJ. For it is quadratic, the minimume point is on the stationary point where:

Jμk=n=1Nrnk(xnμk)=0(6)\frac{\partial J}{\partial \mu_k}=-\sum_{n=1}^{N}r_{nk}(x_n-\mu_k)=0\tag{6}

and we get:

μk=n=1Nrnkxnn=1Nrnk(7)\mu_k = \frac{\sum_{n=1}^{N}r_{nk}x_n}{\sum_{n=1}^{N}r_{nk}}\tag{7}

rnkr_{nk} is the total number of points from the sample {x1,,xN}\{x_1,\cdots, x_N\} who belong to prototype μk\mu_k or group kk at current step. And μk\mu_k is just the average of all the points in the group kk.

This two-step, which was calculated by equation (5),(7), would repeat until rnkr_{nk} and μk\mu_k not change.

The K-means algorithm guarantees to converge because at every step the objective function JJ is reduced. So when there is only one minimum, the global minimum, the algorithm must converge.

Input Data Preprocessing before K-means

Most algorithms need their input data to obey some rules. To the K-means algorithm, we rescale the input data to mean 0 and variance 1. This is always done by

xn(i)=xn(i)xˉ(i)δix_n^{(i)} = \frac{x_n^{(i)}- \bar{x}^{(i)}}{\delta^{i}}

where xn(i)x_n^{(i)} is the iith component of the nnth data point, and xnx_n comes from equ (1), xˉ(i)\bar{x}^{(i)} and δi\delta^{i} is the iith mean and standard deviation

Python code of K-means

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

class K_Means():
"""
input data should be normalized: mean 0, variance 1
"""
def clusturing(self, x, K):
"""
:param x: inputs
:param K: how many groups
:return: prototype(center of each group), r_nk, which group k does the n th point belongs to
"""
data_point_dimension = x.shape[1]
data_point_size = x.shape[0]
center_matrix = np.zeros((K, data_point_dimension))
for i in range(len(center_matrix)):
center_matrix[i] = x[np.random.randint(0, len(x)-1)]

center_matrix_last_time = np.zeros((K, data_point_dimension))
cluster_for_each_point = np.zeros(data_point_size, dtype=np.int32)
# -----------------------------------visualization-----------------------------------
# the part can be deleted
center_color = np.random.randint(0,1000, (K, 3))/1000.
plt.scatter(x[:, 0], x[:, 1], color='green', s=30, marker='o', alpha=0.3)
for i in range(len(center_matrix)):
plt.scatter(center_matrix[i][0], center_matrix[i][1], marker='x', s=65, color=center_color[i])
plt.show()
# -----------------------------------------------------------------------------------
while (center_matrix_last_time-center_matrix).all() != 0:
# E step
for i in range(len(x)):
distance_to_center = np.zeros(K)
for k in range(K):
distance_to_center[k] = (center_matrix[k]-x[i]).dot((center_matrix[k]-x[i]))
cluster_for_each_point[i] = int(np.argmin(distance_to_center))
# M step
number_of_point_in_k = np.zeros(K)
center_matrix_last_time = center_matrix
center_matrix = np.zeros((K, data_point_dimension))
for i in range(len(x)):
center_matrix[cluster_for_each_point[i]] += x[i]
number_of_point_in_k[cluster_for_each_point[i]] += 1

for i in range(len(center_matrix)):
if number_of_point_in_k[i] != 0:
center_matrix[i] /= number_of_point_in_k[i]
# -----------------------------------visualization-----------------------------------
# the part can be deleted
print(center_matrix)
plt.cla()
for i in range(len(center_matrix)):
plt.scatter(center_matrix[i][0], center_matrix[i][1], marker='x', s=65, color=center_color[i])
for i in range(len(x)):
plt.scatter(x[i][0], x[i][1], marker='o',s=30, color=center_color[cluster_for_each_point[i]],alpha=0.7)
plt.show()
# -----------------------------------------------------------------------------------
return center_matrix, cluster_for_each_point

and entir project can be found : https://github.com/Tony-Tan/ML and please star me(_).

Results during K-means

We use a tool https://github.com/Tony-Tan/2DRandomSampleGenerater generating the input data from:

There are two classes from the brown circle and green circle. Then the K-means algorithm initial two prototypes, the centers of groups, randomly:

the two crosses represent the centers.
And then we iterate the two steps:

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Iteration 3 and 4 are unchanged, for both objective function value JJ and parameters. Then the algorithm stoped. And different initial centers may have different convergence speed.

References


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎