Data Science Crash Course 5/10: Getting Data

Learn Data Science in 2020

This is the fifth instalment of Data Science Crash Course. In this lecture I will talk about getting data from different sources. Data is crucial for doing Data Science. Often we have to clean it first in order to start using it. Let’s discuss how to do it.

Data Science Crash Course. Learn Data Science in 2020.

Scraping web for data science

You can download useful data from the web. We’re living in digital world, where virtually anything is available online. Most of data can be accessed from different websites by scraping. This way of getting data is very satisfactory but takes time, because you have to clean HTML code along the way.

Especially financial data is great to get. You can analyse it with tools like Pandas DataFrames and NumPy Arrays which we discussed in the last post.

Let’s say you want to get information from Nasdaq. With GET you can get HTML code right away:

Requests Python for Data Science

However as you can see this is far from readable. The next step would be to use BeautifulSoup to make it pretty:

BeautifulSoup for Data Science

This is more readable but still you need to put in the work to extract some informations from it. That’s a great way to start though.

Have a look at BeautifulSoup documentation for more details.

Scraping is great, because you can learn Data Science by playing around with more information about your hobby. Say if you like video games and want to analyse different statistics then scraping is the best option. You’ll learn more about particular websites of interest to you.

Datasets available on the Internet

Other way to get data is to access already prepared datasets. There countless of examples of them and I’ll just discuss some major datasets here.

Project Gutenberg is a great source for downloading books. You can access older books for which copyrights expired. Think Shakespeare.

Twitter API is another fantastic source. Register a Developer account and then you can start playing around with automation. For example you can pull data about a single hashtag and how people are responding to recent events. This is especially great for research in social sciences or marketing.

Kaggle is a fantastic resource for Data Science competitions but also for ready-to-use datasets which you can download directly, for free, from their website. Major companies often announce their competitions there so you’ll see already what’s of interest in a wider Data Science community.

Established datasets for Data Science

On top of that there are a bunch of datasets which are used for benchmarking machine learning models. I won’t go into details here, but I want to just to direct you to a couple which simply are worth knowing:

  • MNIST — a dataset of handwritten digits
  • CIFAR-10–60,000 images with 10 classes
  • ImageNet — an image database organized according to the WordNet hierarchy
  • IMDB Reviews — a dataset of movie review from IMDB
  • Wikipedia — great resource of texts

That’s all in this lecture. Now it’s time to open your Jupyter Notebooks and play with your favourite dataset.

Practice is always the best way to learn Data Science.

If you prefer to view this lecture on YouTube have a look here:

Data Science Crash Course

Written by

CEO Contentyze, the text editor 2.0, PhD in maths, Forbes 30 under 30 — → Sign up for free at https://app.contentyze.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store