Microsoft Excel is omni present. It is also an excellent vehicle for implementing many algorithms in their basic form to gain a better understanding of the underlying concepts. In my previous post, I had provided an Excel implementation of a classification problem to help understand naive Bayes classifier. In this post I want to demonstrate K-means clustering via an Excel implementation.Clustering implies partitioning data into meaningful groups such that data items within each group are similar but are dissimilar across different groups. One of the most widely used clustering procedure is the K-means clustering algorithm, which seeks to group data into K clusters. The K-means algorithm is meant for situations where data items are represented as vectors of continuous attributes, i.e. numeric attributes, and each cluster is represented by the mean/centroid of respective cluster members. The similarity between different data items in such a case is measured by the Euclidean distance. The steps of the K-means algorithm are:Step 1: Randomly assign every data item to one of the K clusters. (K is a user specified parameter)
When building a classifier, we assume a large enough training data set with labels is available. This situation is what we call as supervised learning. In a real world setting, such training examples with labels need to be acquired. In any application domain where labeling requires domain expertise such as in medicine, gathering a large training set with labels is an expensive and time consuming task. In such cases, it is not uncommon to use a small set of correctly labeled examples to label the rest of training examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. Often the term semi-supervised classification is used to describe the process of labeling training examples using a small set of labeled examples to differentiate from semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples. In this blog post, I am going to illustrate semi-supervised classification leaving semi-supervised clustering for another post. When we have a small set of labeled examples and we want to rely on them to label a much larger set of unlabeled training examples, we need to make some assumptions. For example, we might make an assumption that training examples close to each other are likely to have similar class labels, an assumption made when applying k-nearest neighbor classification. Instead one might assume classes to have Gaussian distribution and we may try to iteratively find the distribution parameters. We must remember that our result will be as good as our assumptions are. Label Propagation Algorithm (LPA) One of the semi-supervised classification method is label propagation that I will explain here. This method is based on the assumption that examples near each other are likely to have similar class labels. The basic idea of this method is to consider all examples, labeled and unlabeled, as interconnected nodes in a network. Each node in the network tries to propagate its label to other nodes. How much of a node’s label influences other nodes is determined by their respective closeness or proximity. We will work through a series of steps to illustrate the working of the label propagation algorithm. Let us consider the following nine training examples, each with two-features:
Clustering is the process of grouping a collection of examples, i.e. data vectors, without class labels into distinct groups such that examples grouped together are similar to each other while those in different groups are different from each other. This process of finding patterns in the data via grouping is also known as unsupervised learning. Since the need for clustering arises in many different disciplines, a large number of clustering algorithms are available to us for use. K-means clustering algorithm is the most widely used clustering procedure. Basically, the algorithm consists of initialing K cluster centers at random. All the example vectors are then assigned to their respective closest centers. Next, the centers are updated and the examples are reassigned to closest centers. This process of updating cluster centers and reassigning examples continues till the cluster centers stop changing. You can see a simple K-means demo via Excel in one of my earlier blog posts.