Python and Tidyverse

Introduction

One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python. That’s why we thought we should provide an introduction to tidyverse for Python blog post.

What is tidyverse?

Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. The core R tidyverse packages are: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats.

Python implementation of dplyr

The tidyverse package dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. Here are some of the functions dplyr provides that are commonly used:

mutate() - adds new variables that are functions of existing variables
select() - picks variables based on their names.
filter() - picks cases based on their values.
summarise() - reduces multiple values down to a single summary.
arrange() - changes the ordering of the rows.

Dplython is a Python implementation of dplyr which can be installed using pip and the following command:

pip install dplython

Instructions on how to use pip to install python packages can be found here.

The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:

select() - used to get specific columns of the data-frame.
sift() - used to filter out rows based on the value of a variable in that row.
sample_n() and sample_frac() - used to provide a random sample of rows from the data-frame.
arrange() - used to sort results.
mutate() - used to create new columns based on existing columns.

For more functions and example code visit the Dplython README page.

At the bottom of the README a comparison is provided to pandas-ply which is another python implementation of dplyr.

Dplython comes with a sample data-set called ‘diamonds’. Here are some basic examples of how to use Dplython.

Import Python packages and the ‘diamonds’ data-frame:

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
    sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

Create a new data-frame by selecting columns of the ‘diamonds’ data-frame:

diamondsSmall = diamonds >> select(X.carat, X.cut, X.price, X.color, X.clarity  , X.depth  , X.table)

Display the top 4 rows of the ‘diamondsSmall’ data-frame:

print(diamondsSmall >> head(4)) 

##    carat      cut  price color clarity  depth  table
## 0   0.23    Ideal    326     E     SI2   61.5   55.0
## 1   0.21  Premium    326     E     SI1   59.8   61.0
## 2   0.23     Good    327     E     VS1   56.9   65.0
## 3   0.29  Premium    334     I     VS2   62.4   58.0

Filter the data-frame for rows where the price is higher than 18,000 and the carat less than 1.2 and sort them by depth:

print((diamondsSmall >> sift(X.price > 18000, X.carat < 1.2) >> arrange(X.depth)))

##        carat        cut  price color clarity  depth  table
## 27455   1.14  Very Good  18112     D      IF   59.1   58.0
## 27457   1.07  Very Good  18114     D      IF   60.9   58.0
## 27530   1.07    Premium  18279     D      IF   60.9   58.0
## 27635   1.04  Very Good  18542     D      IF   61.3   56.0
## 27507   1.09  Very Good  18231     D      IF   61.7   58.0

Provide a random sample of 5 rows from the data-frame

print(diamondsSmall >> sample_n(5))

##        carat        cut  price color clarity  depth  table
## 320     0.71       Good   2801     F     VS2   57.8   60.0
## 9813    0.91    Premium   4670     H     VS1   61.8   54.0
## 11795   1.18  Very Good   5088     E     SI2   62.5   60.0
## 11845   0.95  Very Good   5101     D     SI1   63.7   55.0
## 11552   1.17      Ideal   5032     F     SI1   63.0   54.0

Add a column to the data-frame containing the rounded value of ‘carat’

print((diamondsSmall >> mutate(carat_bin=X.carat.round()) >>  sample_n(5)))

##        carat        cut  price color clarity  depth  table  carat_bin
## 11883   0.99  Very Good   5112     F     SI1   62.5   58.0        1.0
## 45123   0.77       Fair   1651     D     SI2   65.1   63.0        1.0
## 51630   0.31    Premium    544     E     SI1   59.2   60.0        0.0
## 49382   0.51  Very Good   2102     G      IF   62.6   56.0        1.0
## 18296   1.54  Very Good   7437     I     SI2   63.3   60.0        2.0

Python implementation of ggplot2

The tidyverse package ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

A Python port of ggplot2 has long been requested and there are now a few Python implementations of it; Plotnine is the one we will explore here. Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and create, while the plots remain simple.

Plotnine can be installed using pip:

pip install plotnine

Plotnine splits plotting into three distinct parts which are data, aesthetics and layers. The data step adds the data to the graph, the aesthetics (aes) step adds visual attributes and the layers step creates the objects on a plot. Multiple aesthetics and layers functions can be added to a Plotnine graph.

If you are a python user used to Matplotlib it can take some getting used to a Grammar of Graphics plotting tool which is partly due to the difference in philosophy. Plotnine provides some tutorials to help with getting to grips with the package and there is also the Plotnine README. However if you are new to Grammar of Graphics plotting then this highly recommended kaggle notebook for Plotnine is probably the best place to start.

Here are some examples of how to use plotnine to visualize data from the ‘diamonds’ data-frame that comes with Dplython.

Import Python packages, the ‘diamonds’ data-frame and create a sample data-frame:

import warnings; warnings.filterwarnings("ignore") # hide Python warnings 
import pandas
import dplython as dplython
from plotnine import *
diamondsSample = dplython.diamonds >> dplython.sample_n(5000)

Create a scatter plot of ‘carat’ vs ‘price’:

print(ggplot(diamondsSample) # diamondsSample is the data  
 + aes('carat', 'price') # plot 'carat' vs 'price'
 + geom_point() # display the results as a scatter plot
 )

## <ggplot: (41012744)>

Add additional layers e.g. a line of best fit:

print(ggplot(diamondsSample)  
 + aes('carat', 'price') 
 + stat_smooth() # add a line of best fit
 + geom_point()) 

## <ggplot: (-9223372036813567705)>

Add another aesthetic, here the data is coloured by the ‘cut’ variable:

print(ggplot(diamondsSample)
 + aes('carat', 'price')
 + aes(color='cut') # colour the data by the variable cut and create a ledgend 
 + geom_point())

## <ggplot: (-9223372036816020904)>

Add a layer which separates the data into graphs based on ‘colour’

print(ggplot(diamondsSample)
 + aes('carat', 'price')
 + aes(color='cut')
 + facet_wrap('color') # seperate the data by 'colour' and graph seperately  
 + geom_point())

## <ggplot: (64014519)>

This article compares a variety of alternative plotting packages for Python.

Next steps

Read the documents that are linked in this blog post.
Learn the basics of Pandas.
Use Dplython and Plotnine to practice data manipulation & visualization. For example complete some of the exercises at kaggle.

Do you know of other good Python implementations of tidyverse? If so let us know about them!