Data Science Crash Course 7/10: Clustering & Unsupervised Learning
This is 7th instalment of Data Science Crash Course. We learned about supervised learning and what to do when you have a dataset with labels. In this text we’ll look at datasets with no labels provided and talk about unsupervised learning.
What is Unsupervised Learning
Imagine we have raw data like social statistics related to marketing. For example you’re trying to understand who has bought a MacBook from your ecommerce and you’d like to find people who are similar. Or you’re selling tickets through an online platform and you try to group your clients into different categories so that you can have a coherent message to each group.
In order to cluster data or group your data into categories (which are not given a priori!), you have to use one of clustering algorithms. Again sklearn will come helpful. Let’s review two basic methods with a code example from sklearn.
Clustering methods in Data Science
k-means is the basic technique in clustering. “K” here stands for a number of clusters you want to have. This is arbitrary, you choose this parameter, but there are methods (see elbow method for example), where you can automatically infer the best number of clusters. You can use sklearn to group the famous Iris dataset into groups. For example this would be the result of 3 cluster:
If you want to have a simple k-means use case then you can simply cluster into 2 groups points on a plane.
The key to k-means and other technique is having a metric, so a well-defined distance by which we can measure how similar are objects in our dataset. In most practical data coming from spreadsheets, metric is easy to define, we just take the usual distance between vectors in a space as data is purely numeric. Of course you can make it more involved, especially if your data is noisy or you are trying to extract really implicit data, but you don’t have to think for example how to embed a word into a vector space.
Another Data Clustering Algorithms is Density-based spatial clustering of applications with noise or DBSCAN for short. Following sklearn: “The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).” This is also one of standard techniques, it might sounds complicated at first but the principle is easy:
If you want to know how to implement DBSCAN on an example, have a look here for Credit Card dataset and a short tutorial.
Data Science is practical
The best way to learn is practice, so I highly recommend you open your Jupyter Notebook now, go on Kaggle and search for a database which you can use in experiments. There are great datasets in those examples above:
- Credit Card data
- Iris dataset
- Points on a plane
and you can find plenty of other datasets online, ready to use — both cleaned and raw.
Clustering often appears in nature and you’re certainly will use it if you were to work for a larger organization like a bank or an insurance company. It’s just natural that you are trying to group unlabeled data into categories which can be explained. Then if you managed to build categories by clustering, you’re going back to classification problem with new data as you try to put new objects into already existing groups which came from clustering.
In the end Data Science is about classifying and clustering and extracting information from that. It just takes time to master.
If you prefer a video version of this lecture, you can watch it here: