Data Science Crash Course 4/10: Processing Data

Let’s learn Data Science in 2020

Welcome to the fourth instalment of Data Science Crash Course. It will be about finally doing something with Python and Data. I’ll review basic techniques for processing data. How can you store information. We’ve already learned we want to represent our data as vectors and matrices.

Data Science Crash Course. Learn Data Science in 2020!

Importing Data

We can start by importing files. You might have on your computer already

  • txts
  • Excel spreadsheets
  • jsons
  • xmls
  • csv

And you can put it all to work, by importing it into Jupyter Notebook.

Most of it is done with ‘open (…)’ for example like this:

Open CSV in Python

Have a look at Python Programming or Real Python to read more about it.

The same goes for other forms of files.

We already know that we want to represent our data in form of arrays (matrices, vectors), so let’s see how we can do it in Python.

Storing Data

Now the question is how to store them. There are a couple of standard ways to do it.

Numpy Arrays is an easy way to represent arrays and NumPy is one of the best libraries for doing Data Science. Have a look here for an official documentation. And here’s a snippet from Jupyter Notebook if you want to define a vector [1,2,3]:

NumPy Arrays are a great Data Science Tool

Quick and simple, right?

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.

DataFrame, a spreadsheet tabular format, is a part of Pandas and also allows to play with Data. This example shows how to use it to create a simple spreadsheet:

Pandas DataFrame for Data Science

It’s really efficient and it’s great to use, especially if you have used extensively Excel before, this will all be very familiar.

Apart from those two very efficient packages, Python itself is coming with plethora of data structures. Have a look here to see how many there are. Just to number a couple which comes in handy when it comes to manipulating data:

  • lists
  • dictionaries
  • tuples

The best way to learn is to play around with it, so open your Jupyter Notebook now!

Getting Data from the Web

If you don’t have any interesting data on your computer, then the best way is to just scrape information from the web. It’s pretty easy with Python with packages like ‘requests’ and ‘BeautifulSoup’ (to clean data).

Most websites are easily scraped using requests and it’s all just a matter cleaning.

I’ll talk more about getting data in the next lecture, where I’ll also give examples of scraping the code from the web.

If you want to see the video version of this text, have a look here:

Written by

CEO Contentyze, the text editor 2.0, PhD in maths, Forbes 30 under 30 — → Sign up for free at https://app.contentyze.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store