Basic data wrangling with Python

Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:

  1. Basic data wrangling with Python
  2. Basic plotting with matplotlib and seaborn
  3. Comparison of ggplot in R versus in Python

Once I finish writing any of the topics, I will link it to the above.

So, let’s start.

Loading necessary packages

Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to this link to install the packages using pip command or this link to install the packages using conda command. For those who has both R and Python in your machine, I suggest to use a conda command.

Let’s load the required packages.

import numpy as np 
import pandas as pd
from seaborn import load_dataset

All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in pandas package can be accessed through pd or to be specific pd.. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use pd for pandas and np for numpy, but this is a convention or standard practice widely adopted in the Python community.

Load the data

We going to use iris dataset. This dataset is readily available in seaborn package.

iris = load_dataset('iris')

Once we load the data, we need to check the variable type.

iris.dtypes
## sepal_length    float64
## sepal_width     float64
## petal_length    float64
## petal_width     float64
## species          object
## dtype: object

Variable species, by right, is a categorical variable. So, we can use Categorical() from pandas to change it from an object variable type to a category. pd. here, means we access the function from pandas package as I explained it previously.

iris['species'] = pd.Categorical(iris['species'])

If we check the variable type again, we can see the species variable is a category.

iris.dtypes
## sepal_length     float64
## sepal_width      float64
## petal_length     float64
## petal_width      float64
## species         category
## dtype: object

Next, we can also see the data. Let’s see the first 10 rows.

iris.head(10)
##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          3.9           1.7          0.4  setosa
## 6           4.6          3.4           1.4          0.3  setosa
## 7           5.0          3.4           1.5          0.2  setosa
## 8           4.4          2.9           1.4          0.2  setosa
## 9           4.9          3.1           1.5          0.1  setosa

Slicing and indexing

To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.

iris['sepal_length'][0:10]
## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Similarly, we can also index as below to get the first 10 rows of sepal_length variable.

iris['sepal_length'][:10]
## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Next to access the first 5 rows, we can do as below.

iris[0:5]
##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

We can also use iloc() and loc() functions. The main difference between the two functions is that iloc() can only accept a numerical value and loc() function can accept a string value.

iris.iloc[0:2, 0:3] #rows, then columns
##    sepal_length  sepal_width  petal_length
## 0           5.1          3.5           1.4
## 1           4.9          3.0           1.4
iris.loc[0:2, ['sepal_length', 'species']]
##    sepal_length species
## 0           5.1  setosa
## 1           4.9  setosa
## 2           4.7  setosa

Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.

ind = iris['petal_length'] > 6
iris['petal_length'][ind]
## 105    6.6
## 107    6.3
## 109    6.1
## 117    6.7
## 118    6.9
## 122    6.7
## 130    6.1
## 131    6.4
## 135    6.1
## Name: petal_length, dtype: float64

Let’s say we want our data to include only setosa species.

ind = iris['species'] == 'setosa'
iris.loc[ind, :].head()
##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:

  • row 1, 2, 3, and 4 of sepal_length to NA values
  • row 6 of species and sepal_width to NA values
iris.loc[0:3, 'sepal_length'] = np.nan 
iris.iloc[5, [1, 4]] = np.nan

Let’s see the result.

iris.head(6)
##    sepal_length  sepal_width  petal_length  petal_width species
## 0           NaN          3.5           1.4          0.2  setosa
## 1           NaN          3.0           1.4          0.2  setosa
## 2           NaN          3.2           1.3          0.2  setosa
## 3           NaN          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          NaN           1.7          0.4     NaN

Missing values

If we want to see if we have any missing values in our data, we can use isnull() function.

iris.isnull().any().any() #For overall
## True
iris.isnull().any() #Check for each column
## sepal_length     True
## sepal_width      True
## petal_length    False
## petal_width     False
## species          True
## dtype: bool

We can further calculate how many missing values that we have.

iris.isnull().sum()
## sepal_length    4
## sepal_width     1
## petal_length    0
## petal_width     0
## species         1
## dtype: int64

Descriptive statistics

To get a basic descriptive statistics, we can use describe() function. Below, we additionally use round() to round up the results into one decimal points.

iris.describe().round()
##        sepal_length  sepal_width  petal_length  petal_width
## count         146.0        149.0         150.0        150.0
## mean            6.0          3.0           4.0          1.0
## std             1.0          0.0           2.0          1.0
## min             4.0          2.0           1.0          0.0
## 25%             5.0          3.0           2.0          0.0
## 50%             6.0          3.0           4.0          1.0
## 75%             6.0          3.0           5.0          2.0
## max             8.0          4.0           7.0          2.0

Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add include = all as below.

iris.describe(include = 'all').round()
##         sepal_length  sepal_width  petal_length  petal_width     species
## count          146.0        149.0         150.0        150.0         149
## unique           NaN          NaN           NaN          NaN           3
## top              NaN          NaN           NaN          NaN  versicolor
## freq             NaN          NaN           NaN          NaN          50
## mean             6.0          3.0           4.0          1.0         NaN
## std              1.0          0.0           2.0          1.0         NaN
## min              4.0          2.0           1.0          0.0         NaN
## 25%              5.0          3.0           2.0          0.0         NaN
## 50%              6.0          3.0           4.0          1.0         NaN
## 75%              6.0          3.0           5.0          2.0         NaN
## max              8.0          4.0           7.0          2.0         NaN

Alternatively, we can also calculate the unique values for the categorical variable. value_counts() only calculate the non-missing values.

iris['species'].value_counts()
## species
## versicolor    50
## virginica     50
## setosa        49
## Name: count, dtype: int64

Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use mean().

iris['sepal_width'].mean().round()
## 3.0

That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.

Tengku Muhammad Hanis
Tengku Muhammad Hanis
Lead academic trainer

My research interests include medical statistics and machine learning application.

Related