Data Science Crash Course 5/10: Getting Data
This is the fifth instalment of Data Science Crash Course. In this lecture I will talk about getting data from different sources. Data is crucial for doing Data Science. Often we have to clean it first in order to start using it. Let’s discuss how to do it.
Scraping web for data science
You can download useful data from the web. We’re living in digital world, where virtually anything is available online. Most of data can be accessed from different websites by scraping. This way of getting data is very satisfactory but takes time, because you have to clean HTML code along the way.
Especially financial data is great to get. You can analyse it with tools like Pandas DataFrames and NumPy Arrays which we discussed in the last post.
Let’s say you want to get information from Nasdaq. With GET you can get HTML code right away:
However as you can see this is far from readable. The next step would be to use BeautifulSoup to make it pretty:
This is more readable but still you need to put in the work to extract some informations from it. That’s a great way to start though.
Have a look at BeautifulSoup documentation for more details.
Scraping is great, because you can learn Data Science by playing around with more information about your hobby. Say if you like video games and want to analyse different statistics then scraping is the best option. You’ll learn more about particular websites of interest to you.
Datasets available on the Internet
Other way to get data is to access already prepared datasets. There countless of examples of them and I’ll just discuss some major datasets here.
Project Gutenberg is a great source for downloading books. You can access older books for which copyrights expired. Think Shakespeare.
Twitter API is another fantastic source. Register a Developer account and then you can start playing around with automation. For example you can pull data about a single hashtag and how people are responding to recent events. This is especially great for research in social sciences or marketing.
Kaggle is a fantastic resource for Data Science competitions but also for ready-to-use datasets which you can download directly, for free, from their website. Major companies often announce their competitions there so you’ll see already what’s of interest in a wider Data Science community.
Established datasets for Data Science
On top of that there are a bunch of datasets which are used for benchmarking machine learning models. I won’t go into details here, but I want to just to direct you to a couple which simply are worth knowing:
- MNIST — a dataset of handwritten digits
- CIFAR-10–60,000 images with 10 classes
- ImageNet — an image database organized according to the WordNet hierarchy
- IMDB Reviews — a dataset of movie review from IMDB
- Wikipedia — great resource of texts
That’s all in this lecture. Now it’s time to open your Jupyter Notebooks and play with your favourite dataset.
Practice is always the best way to learn Data Science.
If you prefer to view this lecture on YouTube have a look here: