Sit back
Let's learn
About

K-Means Demonstration using Excel

Published on:
April 10, 2023
Published by:
Professor Ishwar Sethi
This post was originally published by one of our partners on:
https://iksinc.tech/k-means-demonstration-using-excel/

Microsoft Excel is omni present. It is also an excellent vehicle for implementing many algorithms in their basic form to gain a better understanding of the underlying concepts. In my previous post, I had provided an Excel implementation of a classification problem to help understand naive Bayes classifier. In this post I want to demonstrate K-means clustering via an Excel implementation.

Clustering implies partitioning data into meaningful groups such that data items within each group are similar but are dissimilar across different groups. One of the most widely used clustering procedure is the K-means clustering algorithm, which seeks to group data into K clusters. The K-means algorithm is meant for situations where data items are represented as vectors of continuous attributes, i.e. numeric attributes, and each cluster is represented by the mean/centroid of respective cluster members. The similarity between different data items in such a case is measured by the Euclidean distance. The steps of the K-means algorithm are:

Step 1: Randomly assign every data item to one of the K clusters. (K is a user specified parameter)

Step2: Calculate center for each cluster by taking mean of its corresponding data item vectors.

Step 3: Using the cluster centers of Step 2, Calculate new membership for each data item by locating its closest cluster center. Go back to Step 2.

Termination: The algorithm terminates when the cluster centers do not change from one iteration to the next.

The K-means clustering process can be viewed as an optimization process where the objective is to minimize the sum of the squared error (SSE), defined as follows:

S S E = i = 1 K x C i | | m i x | | 2

where, the center of the i-th cluster with ni data items, is defined by

m i = 1 n i x C i x

The SSE criteria means that grouping of data items should be as compact as possible. In other words, data items in a cluster should reside close to the cluster center.

The initial clustering assignment has an important effect on the final resulting clusters. This is generally true when the clusters are not well separated. This means that some initializations of cluster centers can result in algorithm terminating in local minima. Thus, it is a good idea to run K-means many times with different initial assignments. It should be obvious that as K goes up, the SSE goes down. The SSE goes to zero when the number of clusters equals the number of data items, i.e. each data item is its own cluster. Since the number of clusters is generally not known, it is common to run K-means for a range of K values and plot SSE against different K values. A knee point in such a plot generally indicates the value of K that should be selected.

Excel Illustration of K-means

I will illustrate K-means using a small set, 10 to be exact, of datapoints in a two-dimensional space for easy visualization. To run through successive iterations, Excel will be used in the manual calculation mode. Lets set the maximum number of iterations to 5 by entering this value in a cell, naming it as max_count. Next, we set up some indicators to start the clustering process and to step through it. For this, lets name a cell iteration_count and enter in it the formula =IF(indicator_2,iteration_count+1,0), where indicator_2 is a cell that takes on logical True/False values. The formula in cell iteration_count behaves as a counter which gets initialized by the status of cell indicator_2. We use another cell named indicator_1 to limit the iterations to the number specified in cell max_count, and initialize the clustering process. We do this by entering in cell indicator_1 the formula =IF(iteration_count<max_count, TRUE,FALSE) and linking cell indicator_2 to indicator_1 by using the formula =indicator_1 in cell indicator_2. At this stage, our worksheet in formula view looks as shown below.

Screen Shot 2017-10-03 at 11.03.40 AM

The next step is to set up a formula to assign cluster membership to each data point. We set it up in Column C next to data values. The formula is =IF(indicator_1,label,RANDBETWEEN(1,3)). It assigns randomly a membership number between 1 to 3 assuming we are going to create three clusters. The random assignment is done only at the beginning of the clustering process as determined by indicator_1; otherwise the cluster label is picked up by the nearest center, calculated separately and shown at cell location defined as label. A plot of initial assignments is shown below. I have also applied conditional formatting to cluster membership column to show memberships in different colors.

Screen Shot 2017-10-03 at 12.52.57 PM

Next, we need to set up calculations for computing cluster centers based upon assigned data points. We do this at locations, say A19:B21, using the following formulas for cluster number 1 and similar formulas for other two clusters. In these formulas, cmembership refers to cell range C2:C11 where we are showing cluster membership.

=SUMIF(cmembership,1,feat1)/COUNTIF(cmembership,1), and =SUMIF(cmembership,1,feat2)/COUNTIF(cmembership,1)

To find the closest cluster centers for data points, we set up distance calculations and finding the cluster label of the closest center. A formula view of the appropriate part of the worksheet is shown below.

Screen Shot 2017-10-03 at 5.14.43 PM

Once we have cluster centers, we can plot them along with data points and their respective cluster labels. At this stage, our worksheet is ready to perform clustering as we manually click through successive calculations. A series of plots shown below illustrate the K-means clustering process showing how cluster centers move and assignments of data points change through the process. In each plot, the numeric label next to a point indicates the current cluster membership of that point. These memberships determine the cluster centers shown in the plots. The three panels of plots correspond to three different runs of the K-means algorithm with different initializations, and are arranged from left to right as we step through the algorithm. Since the three groups off data points are well separated, the clustering process yields identical results through three different initializations; however, the number of iterations required to reach stable groupings are different.

combinedrun

Although I demonstrated the algorithm for three clusters with a small set of data points, it is pretty straightforward to extend this demo to more points and a different number of clusters. You can download the KmeansExcel demo here if you like.

Check Out These Brilliant Topics
Understanding Tensors and Tensor Decompositions: Part 3
Published on:
April 6, 2023
Published by:
Professor Ishwar Sethi

This is my third post on tensors and tensor decompositions. The first post on this topic primarily introduced tensors and some related terminology. The second post was meant to illustrate a particularly simple tensor decomposition method, called the CP decomposition. In this post, I will describe another tensor decomposition method, known as the Tucker decomposition. While the CP decomposition’s chief merit is its simplicity, it is limited in its approximation capability and it requires the same number of components in each mode. The Tucker decomposition, on the other hand, is extremely efficient in terms of approximation and allows different number of components in different modes. Before going any further, lets look at factor matrices and n-mode product of a tensor and a matrix. Factor Matrices Recall the CP decomposition of an order three tensor expressed as X≈∑r=1Rar∘br∘cr, where (∘ ) represents the outer product. We can also represent this decomposition in terms of organizing the vectors, ar,br,cr,r=1,⋯R , into three matrices, A, B, and C, as A=[a1a2⋯aR], B=[b1b2⋯bR],and C=[c1c2⋯cR] The CP decomposition is then expressed as X≈[Λ;A,B,C], where Λ is a super-diagonal tensor with all zero entries except the diagonal elements. The matrices A, B, and C are called the factor matrices. Next, lets try to understand the n-mode product. Multiplying a Tensor and a Matrix How do you multiply a tensor and a matrix? The answer is via n-mode product. The n-mode product of a tensor X∈RI1×I2×⋯IN with a matrix U∈RJ1×In is a tensor of size I1×I2×⋯In−1×J×In+1×⋯×IN, and is denoted by X×nU . The product is calculated by multiplying each mode-n fibre by the U matrix. Lets look at an example to better understand the n-mode product. Lets consider a 2x2x3 tensor whose frontal slices are:

Want Us to Send You New Posts?

We add Value. Not spam.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Kevadiya INC. © 2023 All Rights Reserved.