Kmeans Clustering
Keywords: Kmeans
Original form KMeans algorithm might be one of the most accessible algorithms in machine learning. And many books and courses started with it. However, if we convert the task which Kmeans dealt with into a more mathematical form, there would be more interesting aspects coming to us.
Clustering Problem^{[1]}
The first thing we should do before introducing the algorithm is to make the task clear. And a precise mathematical form is usually the best way.
Clustering is a kind of unsupervised learning task. So there is no correct nor incorrect solution. Clustering is similar to classification during predicting since the output of clustering and classification is discrete. However, during training classifiers, we always have a certain target corresponding to every input. On the contrary, clustering has no target at all, and what we have is only
$\{x_1,\cdots, x_N\}\tag{1}$
where $x_i\in\mathbb{R}^D$ for $i=1,2,\cdots,N$. And our mission is to separate the dataset into $K$ groups (where $K$ has been given before task)
An intuitive strategy of clustering based on two considerations:
 the distance between data points in the same group should be as small as possible.
 the distance between data points in the different groups should be as large as possible.
Basing on these two points, some concepts were formed. The first one is how to represent a group. We take
$\mu_i:i\in\{1,2,\cdots, K\}\tag{2}$
as the prototype associated with $i$th group. A group always contains several points, and a spontaneous idea is using the center of all the points belonging to the group as its prototype. To represent which group $\boldsymbol{x}_i$ in equation (1) belongs to, an indicator is necessary, and 1ofK coding scheme is used, where the indicator:
$r_{nk}\in\{0,1\}\tag{3}$
for $k=1,2,\cdots,K$ representing the group number and $n = 1,2,\cdots,N$ denoting the number of sample point, and where $r_{nk}=1$ then $r_{nj}=0$ for all $j\neq k$.
Objective Function
A loss function is a good way to measure the quantity of our model during both training and testing stages. And in the clustering task loss function could not be used for we have no idea about loss. However, we can build another function that plays the same role as loss function and it is also the target of what we want.
According to the two base points above, we build our objective function:
$J=\sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}\boldsymbol{x}_n\mu_k^2\tag{4}$
In this objective function, the distance is defined as Euclidean distance(However, other measurements of similarity could also be used). Then the mission is to minimize $J$ by finding some certain $\{r_{nk}\}$ and $\{\mu_k\}$
KMeans Algorithm
Now, let’s represent the famous KMeans algorithm. The method includes two steps:
 Minimising $J$ respect to $r_{nk}$ keeping $\mu_k$ fixed
 Minimising $J$ respect to $\mu_k$ keeping $r_{nk}$ fixed
In the first step, according to equation (4), the objective function is linear of $r_{nk}$. So there is a close solution. Then we set:
$r_{nk}=\begin{cases} 1&\text{ if } k=\mathop{argmin}_{j}x_n\mu_j^2\\ 0&\text{otherwise} \end{cases}\tag{5}$
And in the second step, $r_{nk}$ is fixed and we minimise objective function $J$. For it is quadratic, the minimume point is on the stationary point where:
$\frac{\partial J}{\partial \mu_k}=\sum_{n=1}^{N}r_{nk}(x_n\mu_k)=0\tag{6}$
and we get:
$\mu_k = \frac{\sum_{n=1}^{N}r_{nk}x_n}{\sum_{n=1}^{N}r_{nk}}\tag{7}$
$r_{nk}$ is the total number of points from the sample $\{x_1,\cdots, x_N\}$ who belong to prototype $\mu_k$ or group $k$ at current step. And $\mu_k$ is just the average of all the points in the group $k$.
This twostep, which was calculated by equation (5),(7), would repeat until $r_{nk}$ and $\mu_k$ not change.
The Kmeans algorithm guarantees to converge because at every step the objective function $J$ is reduced. So when there is only one minimum, the global minimum, the algorithm must converge.
Input Data Preprocessing before Kmeans
Most algorithms need their input data to obey some rules. To the Kmeans algorithm, we rescale the input data to mean 0 and variance 1. This is always done by
$x_n^{(i)} = \frac{x_n^{(i)} \bar{x}^{(i)}}{\delta^{i}}$
where $x_n^{(i)}$ is the $i$th component of the $n$th data point, and $x_n$ comes from equ (1), $\bar{x}^{(i)}$ and $\delta^{i}$ is the $i$th mean and standard deviation
Python code of Kmeans
1 

and entir project can be found : https://github.com/TonyTan/ML and please star me(^{_}).
Results during Kmeans
We use a tool https://github.com/TonyTan/2DRandomSampleGenerater generating the input data from:
There are two classes from the brown circle and green circle. Then the Kmeans algorithm initial two prototypes, the centers of groups, randomly:
the two crosses represent the centers.
And then we iterate the two steps:
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 3 and 4 are unchanged, for both objective function value $J$ and parameters. Then the algorithm stoped. And different initial centers may have different convergence speed.
References
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎