Clustering techniques fall into a group of undirected data mining tools. The goal of undirected data mining is to discover structure in the data as a whole. There is no target variable to be predicted, thus no distinction is being made between independent and dependent variables.
Clustering techniques are used for combining observed examples into clusters (groups) which satisfy two main criteria:
- each group or cluster is homogeneous; examples that belong to the same group are similar to each other.
- each group or cluster should be different from other clusters, that is, examples that belong to one cluster should be different from the examples of other clusters.
Depending on the clustering technique, clusters can be expressed in different ways:
- identified clusters may be exclusive, so that any example belongs to only one cluster.
- they may be overlapping; an example may belong to several clusters.
- they may be probabilistic, whereby an example belongs to each cluster with a certain probability.
- clusters might have hierarchical structure, having crude division of examples at highest level of hierarchy, which is then refined to sub-clusters at lower levels.
We will explain here the basics of the simplest of clustering methods: k-means algorithm. There are many other methods, like self-organizing maps (Kohonen networks), or probabilistic clustering methods (AutoClass algorithm), which are more sophisticated and proficient, but k-means algorithm seemed a best choice for the illustration of the main principles.
This algorithm has as an input a predefined number of clusters, that is the k from its name. Means stands for an average, an average location of all the members of a particular cluster. When dealing with clustering techniques, one has to adopt a notion of a high dimensional space, or space in which ortogonal dimensions are all attributes from the table of data we are analyzing. The value of each attribute of an example represents a distance of the example from the origin along the attribute axes. Of course, in order to use this geometry efficiently, the values in the data set must all be numeric (categorical data must be transformed into numeric ones!) and should be normalized in order to allow fair computation of the overall distances in a multi-attribute space.
K-means algorithm is a simple, iterative procedure, in which a crucial concept is the one of centroid. Centroid is an artificial point in the space of records which represents an average location of the particular cluster. The coordinates of this point are averages of attribute values of all examples that belong to the cluster. The steps of the K-means algorithm are given in Figure 1.
- Select randomly k points (it can be also examples) to be the
seeds for the centroids of k clusters.
- Assign each example to the centroid closest to the example,
forming in this way k exclusive clusters of examples.
- Calculate new centroids of the clusters. For that purpose average
all attribute values of the examples belonging to the same cluster (centroid).
- Check if the cluster centroids have changed their "coordinates".
If yes, start again form the step 2). If not, cluster detection is
finished and all examples have their cluster memberships defined.
Figure 1. K-means algorithm
Usually this iterative procedure of redefining centroids and reassigning the examples to clusters needs only a few iterations to converge.
Important issues in automatic cluster detection
Most of the issues related to automatic cluster detection are connected to the kinds of questions we want to be answered in the data mining project, or data preparation for their successful application.
Most clustering techniques use for the distance measure the Euclidean distance formula (square root of the sum of the squares of distances along each attribute axes).
Non-numeric variables must be transformed and scaled before the clustering can take place. Depending on this transformations, the categorical variables may dominate clustering results or they may be even completely ignored.
Choice of the right number of clusters
If the number of clusters k in the K-means method is not chosen so to match the natural structure of the data, the results will not be good. The proper way t alleviate this is to experiment with different values for k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-cluster distances. More sophisticated techniques measure this qualities automatically, and optimize the number of clusters in a separate loop (AutoClass).
Once the clusters are discovered they have to be interpreted in order to have some value for the data mining project. There are different ways to utilize clustering results:
- Cluster membership can be used as a label for the separate classification problem. Some descriptive data mining technique (like decision trees) can be used to find descriptions of clusters.
- Clusters can be visualized using 2D and 3D scatter graphs or some other visualization technique.
- Differences in attribute values among different clusters can be examined, one attribute at a time.
Clustering techniques are used when we expect natural groupings in examples of the data. Clusters should then represent groups of items (products, events, customers) that have a lot in common. Creating clusters prior to application of some other data mining technique (decision trees, neural networks) might reduce the complexity of the problem by dividing the space of examples. This space partitions can be mined separately and such two step procedure might exhibit improved results (descriptive or predictive) as compared to data mining without using clustering.
Links to online tutorials on clustering
Data Clustering and Its Applications
by Raza Ali, Usman Ghani , Aasim Saeed
© 2001 LIS - Rudjer Boskovic Institute
Last modified: July 31 2014 13:05:21.