K-Means Clustering Simplified

Very common technique in machine learning, where you just try to take a bunch of data and find interesting clusters of things, just based on the attributes of the data itself.

Sounds fancy, but it’s actually pretty simple. All we do in K-means clustering is try to split our data into K groups, that’s where the K comes from: it’s how many different groups you’re trying to split your data into.

Definition

An unsupervised learning technique where you have a collection of stuff that you want to group together into various clusters. Maybe it’s movie genres or demographics of people; just based on the attributes of the data itself.

Explanation

K-mean algorithm creates clusters by finding K centroids. So basically, what group a given data point belongs to is defined by which of these centroid points it’s closest to in your scatter plot.

You can visualize that over here, this is showing an example of K-means clustering with K of three(3), and the squares represent data points in a scatter plot.

The circles represent the centroids that the K-means clustering algorithm came up with, and each point is assigned a cluster based on which centroid it’s closest to.

So basically, what group a given data point belongs to is defined by which of these centroid points it’s closest to in your scatter plot.

K-Mean Algorithm Steps

Choose the number of clusters k.
Select k random points from the data as centroids.
Assign all the points to the closest cluster centroid.
Recomputed the centroids of newly formed clusters.
Repeat steps 3 and 4 if the centroid locations changed.

When to Use K-Means Clustering

When data sets are distinct or well separated from each other in a linear fashion.
When the number of cluster centers, is specified due to a well-defined list of types in the data.

K-Mean Pros:

Simple: It is easy to implement k-means and identify unknown groups of data from complex data sets.
Flexible: If there are any problems, adjusting the cluster segment will allow changes to easily occur on the algorithm.
Accuracy: K-means analysis improves clustering accuracy and ensures information about a particular problem domain is available.
Suitable in a Large Dataset: K-means is suitable for a large number of datasets and it’s computed much faster than the smaller dataset.
Time Complexity: K-means segmentation is linear in the number of data objects thus increasing execution time.
Less Computation Cost: Compared to using other clustering methods, a k-means clustering technique is fast and efficient.

K-Mean Cons:

User defined clusters: K-means doesn’t allow development of an optimal set of clusters and for effective results, you should decide on the clusters before.
Order of values effect result: The way in which data is ordered in building the algorithm affects the final results of the data set.
Crash when dealing with a large dataset: Creating a dendrogram technique will crash the computer due to a lot of computational load.
Limited to numerical data: K-means algorithm can be performed in numerical data only.
Assumes Spherical Clusters: K-means clustering technique assumes that we deal with spherical clusters and each cluster has equal numbers for observations.
Prediction problems: It is difficult to predict the k-values or the number of clusters. It is also difficult to compare the quality of the produced clusters.

Example of K-Mean in Python

Lets create some random data to try and cluster. We type this little create clustered data function in python and it starts off with a consistent random seed so you’ll get the same result every time.

from numpy import random, array

#Create fake income/age clusters for N people in k clusters

def createClusteredData(N, k):
random.seed(10)
pointsPerCluster = float(N)/k

X = []

for i in range (k):
incomeCentroid = random.uniform(20000.0, 200000.0)
ageCentroid = random.uniform(20.0, 70.0)

for j in range(int(pointsPerCluster)):
X.append([random.normal(incomeCentroid, 10000.0),random.normal(ageCentroid, 2.0)])

X = array(X)

return X

And it takes N, I want to create clusters of N people in K clusters, so it figures out how many points per cluster that works out to first, and then builds up this list X that starts off empty.

We’ll use k-means to rediscover these clusters in unsupervised learning:

%matplotlib inline

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from numpy import random, float

data = createClusteredData(100, 5)
model = KMeans(n_clusters=5)

# Note I'm scaling the data to normalize it! Important for good results.

model = model.fit(scale(data))

# We can look at the clusters each data point was assigned to

print(model.labels_)

# And we'll visualize it:

plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))
plt.show()

Output:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

That’s all there is to K-Means clustering, it’s just that simple. You can just learn scikit-learns K-Means from cluster.

Before that make sure you scale and normalize the data. You wanna make sure that the things you’re using K-Means on are comparable to each other and the scale function will do that for you. So those are the main things for K-Means clustering.

So if you have data that is unclassified and you don’t really have the right answers ahead of time, it’s a good way to try to naturally find interesting groupings of your data and maybe that can give you some insight into what that data is.

Thank you for reading this post, I hope you enjoyed and learn something new today. Feel free to contact me through my blog if you have questions, I will be more than happy to help.

Stay safe and Happy learning!