Data Science Crash Course 9/10: Dimensionality Reduction
This is 9th instalment of Data Science Crash Course and we’re going to talk about how to reduce a number of dimensions in our data, so that it can be visualised and better understood.
Principal Component Analysis
Imagine you want to plot data but your data has too many dimensions to do that right away. Here comes Dimensionality Reduction. These methods allow you to look only at certain dimensions which have the most relevant information for you.
Standard technique is PCA, Principal Component Analysis, which is looking at eigenvectors (singular values) and projecting to those which capture the most of data. So you can transform for example 4D into 2D, by projecting onto 2D space spanned by two largest singular values.
Let’s look at a sklearn code for a simple PCA use with a NumPy array:
Here we just use PCA with 2 components and we get singular values for our space. To really understand what’s going on, you should read a couple of sources:
- Have a look here for a thorough discussion of singular values and theory behind PCA.
- Another great resource is this text on how to apply PCA to Iris dataset.
- You can use PCA to reduce your dimensions by project to largest singular values. Have a look here for how to do it with wine data.
Dimensionality Reduction in Data Science
Dimensionality reduction is important because you want to get rid of unnecessary data, irrelevant columns in spreadsheets, in order to apply data science algorithms in the most efficient way. Less parameters are easier to control and manipulate.
There are many other methods used for dimensionality reduction. More advanced include for example Manifold Learning. Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high. t-SNE is one of the most well-known technique of manifold learning and you can read more about it in sklearn documentation.
All in all, how you’re going to reduce dimensions depend entirely on data you’re working with. Sometimes it’ll be obvious from the start that you should ignore a couple of columns or rows to get better results. Sometimes it’ll be hard to see what is really important and then you’ll have to try a bunch of dimensionality reduction algorithms to unwind the data and dependencies between objects.
But that is why Data Science is practical — there’s no understanding without playing around with data. Good luck!
If you prefer a video of this text, then watch this: