Data Science Crash Course 9/10: Dimensionality Reduction

Let’s learn Data Science in 2020

This is 9th instalment of Data Science Crash Course and we’re going to talk about how to reduce a number of dimensions in our data, so that it can be visualised and better understood.

Dimensionality Reduction in Python. Data Science Crash Course.

Principal Component Analysis

Imagine you want to plot data but your data has too many dimensions to do that right away. Here comes Dimensionality Reduction. These methods allow you to look only at certain dimensions which have the most relevant information for you.

Standard technique is PCA, Principal Component Analysis, which is looking at eigenvectors (singular values) and projecting to those which capture the most of data. So you can transform for example 4D into 2D, by projecting onto 2D space spanned by two largest singular values.

Let’s look at a sklearn code for a simple PCA use with a NumPy array:

PCA with sklearn

Here we just use PCA with 2 components and we get singular values for our space. To really understand what’s going on, you should read a couple of sources:

Dimensionality Reduction in Data Science

Dimensionality reduction is important because you want to get rid of unnecessary data, irrelevant columns in spreadsheets, in order to apply data science algorithms in the most efficient way. Less parameters are easier to control and manipulate.

There are many other methods used for dimensionality reduction. More advanced include for example Manifold Learning. Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high. t-SNE is one of the most well-known technique of manifold learning and you can read more about it in sklearn documentation.

All in all, how you’re going to reduce dimensions depend entirely on data you’re working with. Sometimes it’ll be obvious from the start that you should ignore a couple of columns or rows to get better results. Sometimes it’ll be hard to see what is really important and then you’ll have to try a bunch of dimensionality reduction algorithms to unwind the data and dependencies between objects.

But that is why Data Science is practical — there’s no understanding without playing around with data. Good luck!

If you prefer a video of this text, then watch this:

Data Science Crash Course on YouTube. Dimensionality Reduction.

Written by

CEO Contentyze, the text editor 2.0, PhD in maths, Forbes 30 under 30 — → Sign up for free at https://app.contentyze.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store