Basic data wrangling with Python
Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:
- Basic data wrangling with Python
- Basic plotting with matplotlib and seaborn
- Comparison of ggplot in R versus in Python
Once I finish writing any of the topics, I will link it to the above.
So, let’s start.
Loading necessary packages
Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to this link to install the packages using pip command or this link to install the packages using conda command. For those who has both R and Python in your machine, I suggest to use a conda command.
Let’s load the required packages.
import numpy as np
import pandas as pd
from seaborn import load_dataset
All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in pandas
package can be accessed through pd
or to be specific pd.
. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use pd
for pandas
and np
for numpy
, but this is a convention or standard practice widely adopted in the Python community.
Load the data
We going to use iris dataset. This dataset is readily available in seaborn package.
iris = load_dataset('iris')
Once we load the data, we need to check the variable type.
iris.dtypes
## sepal_length float64
## sepal_width float64
## petal_length float64
## petal_width float64
## species object
## dtype: object
Variable species, by right, is a categorical variable. So, we can use Categorical()
from pandas
to change it from an object variable type to a category. pd.
here, means we access the function from pandas
package as I explained it previously.
iris['species'] = pd.Categorical(iris['species'])
If we check the variable type again, we can see the species variable is a category.
iris.dtypes
## sepal_length float64
## sepal_width float64
## petal_length float64
## petal_width float64
## species category
## dtype: object
Next, we can also see the data. Let’s see the first 10 rows.
iris.head(10)
## sepal_length sepal_width petal_length petal_width species
## 0 5.1 3.5 1.4 0.2 setosa
## 1 4.9 3.0 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
## 5 5.4 3.9 1.7 0.4 setosa
## 6 4.6 3.4 1.4 0.3 setosa
## 7 5.0 3.4 1.5 0.2 setosa
## 8 4.4 2.9 1.4 0.2 setosa
## 9 4.9 3.1 1.5 0.1 setosa
Slicing and indexing
To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.
iris['sepal_length'][0:10]
## 0 5.1
## 1 4.9
## 2 4.7
## 3 4.6
## 4 5.0
## 5 5.4
## 6 4.6
## 7 5.0
## 8 4.4
## 9 4.9
## Name: sepal_length, dtype: float64
Similarly, we can also index as below to get the first 10 rows of sepal_length variable.
iris['sepal_length'][:10]
## 0 5.1
## 1 4.9
## 2 4.7
## 3 4.6
## 4 5.0
## 5 5.4
## 6 4.6
## 7 5.0
## 8 4.4
## 9 4.9
## Name: sepal_length, dtype: float64
Next to access the first 5 rows, we can do as below.
iris[0:5]
## sepal_length sepal_width petal_length petal_width species
## 0 5.1 3.5 1.4 0.2 setosa
## 1 4.9 3.0 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
We can also use iloc()
and loc()
functions. The main difference between the two functions is that iloc()
can only accept a numerical value and loc()
function can accept a string value.
iris.iloc[0:2, 0:3] #rows, then columns
## sepal_length sepal_width petal_length
## 0 5.1 3.5 1.4
## 1 4.9 3.0 1.4
iris.loc[0:2, ['sepal_length', 'species']]
## sepal_length species
## 0 5.1 setosa
## 1 4.9 setosa
## 2 4.7 setosa
Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.
ind = iris['petal_length'] > 6
iris['petal_length'][ind]
## 105 6.6
## 107 6.3
## 109 6.1
## 117 6.7
## 118 6.9
## 122 6.7
## 130 6.1
## 131 6.4
## 135 6.1
## Name: petal_length, dtype: float64
Let’s say we want our data to include only setosa species.
ind = iris['species'] == 'setosa'
iris.loc[ind, :].head()
## sepal_length sepal_width petal_length petal_width species
## 0 5.1 3.5 1.4 0.2 setosa
## 1 4.9 3.0 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:
- row 1, 2, 3, and 4 of sepal_length to NA values
- row 6 of species and sepal_width to NA values
iris.loc[0:3, 'sepal_length'] = np.nan
iris.iloc[5, [1, 4]] = np.nan
Let’s see the result.
iris.head(6)
## sepal_length sepal_width petal_length petal_width species
## 0 NaN 3.5 1.4 0.2 setosa
## 1 NaN 3.0 1.4 0.2 setosa
## 2 NaN 3.2 1.3 0.2 setosa
## 3 NaN 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
## 5 5.4 NaN 1.7 0.4 NaN
Missing values
If we want to see if we have any missing values in our data, we can use isnull()
function.
iris.isnull().any().any() #For overall
## True
iris.isnull().any() #Check for each column
## sepal_length True
## sepal_width True
## petal_length False
## petal_width False
## species True
## dtype: bool
We can further calculate how many missing values that we have.
iris.isnull().sum()
## sepal_length 4
## sepal_width 1
## petal_length 0
## petal_width 0
## species 1
## dtype: int64
Descriptive statistics
To get a basic descriptive statistics, we can use describe()
function. Below, we additionally use round()
to round up the results into one decimal points.
iris.describe().round()
## sepal_length sepal_width petal_length petal_width
## count 146.0 149.0 150.0 150.0
## mean 6.0 3.0 4.0 1.0
## std 1.0 0.0 2.0 1.0
## min 4.0 2.0 1.0 0.0
## 25% 5.0 3.0 2.0 0.0
## 50% 6.0 3.0 4.0 1.0
## 75% 6.0 3.0 5.0 2.0
## max 8.0 4.0 7.0 2.0
Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add include = all
as below.
iris.describe(include = 'all').round()
## sepal_length sepal_width petal_length petal_width species
## count 146.0 149.0 150.0 150.0 149
## unique NaN NaN NaN NaN 3
## top NaN NaN NaN NaN versicolor
## freq NaN NaN NaN NaN 50
## mean 6.0 3.0 4.0 1.0 NaN
## std 1.0 0.0 2.0 1.0 NaN
## min 4.0 2.0 1.0 0.0 NaN
## 25% 5.0 3.0 2.0 0.0 NaN
## 50% 6.0 3.0 4.0 1.0 NaN
## 75% 6.0 3.0 5.0 2.0 NaN
## max 8.0 4.0 7.0 2.0 NaN
Alternatively, we can also calculate the unique values for the categorical variable. value_counts()
only calculate the non-missing values.
iris['species'].value_counts()
## species
## versicolor 50
## virginica 50
## setosa 49
## Name: count, dtype: int64
Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use mean()
.
iris['sepal_width'].mean().round()
## 3.0
That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.