Python and Tidyverse
Introduction
One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python. That’s why we thought we should provide an introduction to tidyverse for Python blog post.
What is tidyverse?
Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. The core R tidyverse packages are: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats.
Python implementation of dplyr
The tidyverse package dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. Here are some of the functions dplyr provides that are commonly used:
- mutate() - adds new variables that are functions of existing variables
- select() - picks variables based on their names.
- filter() - picks cases based on their values.
- summarise() - reduces multiple values down to a single summary.
- arrange() - changes the ordering of the rows.
Dplython is a Python implementation of dplyr which can be installed using pip and the following command:
pip install dplython
Instructions on how to use pip to install python packages can be found here.
The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:
- select() - used to get specific columns of the data-frame.
- sift() - used to filter out rows based on the value of a variable in that row.
- sample_n() and sample_frac() - used to provide a random sample of rows from the data-frame.
- arrange() - used to sort results.
- mutate() - used to create new columns based on existing columns.
For more functions and example code visit the Dplython README page.
At the bottom of the README a comparison is provided to pandas-ply which is another python implementation of dplyr.
Dplython comes with a sample data-set called ‘diamonds’. Here are some basic examples of how to use Dplython.
Import Python packages and the ‘diamonds’ data-frame:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)
Create a new data-frame by selecting columns of the ‘diamonds’ data-frame:
diamondsSmall = diamonds >> select(X.carat, X.cut, X.price, X.color, X.clarity , X.depth , X.table)
Display the top 4 rows of the ‘diamondsSmall’ data-frame:
print(diamondsSmall >> head(4))
## carat cut price color clarity depth table
## 0 0.23 Ideal 326 E SI2 61.5 55.0
## 1 0.21 Premium 326 E SI1 59.8 61.0
## 2 0.23 Good 327 E VS1 56.9 65.0
## 3 0.29 Premium 334 I VS2 62.4 58.0
Filter the data-frame for rows where the price is higher than 18,000 and the carat less than 1.2 and sort them by depth:
print((diamondsSmall >> sift(X.price > 18000, X.carat < 1.2) >> arrange(X.depth)))
## carat cut price color clarity depth table
## 27455 1.14 Very Good 18112 D IF 59.1 58.0
## 27457 1.07 Very Good 18114 D IF 60.9 58.0
## 27530 1.07 Premium 18279 D IF 60.9 58.0
## 27635 1.04 Very Good 18542 D IF 61.3 56.0
## 27507 1.09 Very Good 18231 D IF 61.7 58.0
Provide a random sample of 5 rows from the data-frame
print(diamondsSmall >> sample_n(5))
## carat cut price color clarity depth table
## 320 0.71 Good 2801 F VS2 57.8 60.0
## 9813 0.91 Premium 4670 H VS1 61.8 54.0
## 11795 1.18 Very Good 5088 E SI2 62.5 60.0
## 11845 0.95 Very Good 5101 D SI1 63.7 55.0
## 11552 1.17 Ideal 5032 F SI1 63.0 54.0
Add a column to the data-frame containing the rounded value of ‘carat’
print((diamondsSmall >> mutate(carat_bin=X.carat.round()) >> sample_n(5)))
## carat cut price color clarity depth table carat_bin
## 11883 0.99 Very Good 5112 F SI1 62.5 58.0 1.0
## 45123 0.77 Fair 1651 D SI2 65.1 63.0 1.0
## 51630 0.31 Premium 544 E SI1 59.2 60.0 0.0
## 49382 0.51 Very Good 2102 G IF 62.6 56.0 1.0
## 18296 1.54 Very Good 7437 I SI2 63.3 60.0 2.0
Python implementation of ggplot2
The tidyverse package ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
A Python port of ggplot2 has long been requested and there are now a few Python implementations of it; Plotnine is the one we will explore here. Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and create, while the plots remain simple.
Plotnine can be installed using pip:
pip install plotnine
Plotnine splits plotting into three distinct parts which are data, aesthetics and layers. The data step adds the data to the graph, the aesthetics (aes) step adds visual attributes and the layers step creates the objects on a plot. Multiple aesthetics and layers functions can be added to a Plotnine graph.
If you are a python user used to Matplotlib it can take some getting used to a Grammar of Graphics plotting tool which is partly due to the difference in philosophy. Plotnine provides some tutorials to help with getting to grips with the package and there is also the Plotnine README. However if you are new to Grammar of Graphics plotting then this highly recommended kaggle notebook for Plotnine is probably the best place to start.
Here are some examples of how to use plotnine to visualize data from the ‘diamonds’ data-frame that comes with Dplython.
Import Python packages, the ‘diamonds’ data-frame and create a sample data-frame:
import warnings; warnings.filterwarnings("ignore") # hide Python warnings
import pandas
import dplython as dplython
from plotnine import *
diamondsSample = dplython.diamonds >> dplython.sample_n(5000)
Create a scatter plot of ‘carat’ vs ‘price’:
print(ggplot(diamondsSample) # diamondsSample is the data
+ aes('carat', 'price') # plot 'carat' vs 'price'
+ geom_point() # display the results as a scatter plot
)
## <ggplot: (41012744)>
Add additional layers e.g. a line of best fit:
print(ggplot(diamondsSample)
+ aes('carat', 'price')
+ stat_smooth() # add a line of best fit
+ geom_point())
## <ggplot: (-9223372036813567705)>
Add another aesthetic, here the data is coloured by the ‘cut’ variable:
print(ggplot(diamondsSample)
+ aes('carat', 'price')
+ aes(color='cut') # colour the data by the variable cut and create a ledgend
+ geom_point())
## <ggplot: (-9223372036816020904)>
Add a layer which separates the data into graphs based on ‘colour’
print(ggplot(diamondsSample)
+ aes('carat', 'price')
+ aes(color='cut')
+ facet_wrap('color') # seperate the data by 'colour' and graph seperately
+ geom_point())
## <ggplot: (64014519)>
This article compares a variety of alternative plotting packages for Python.
Next steps
- Read the documents that are linked in this blog post.
- Learn the basics of Pandas.
- Use Dplython and Plotnine to practice data manipulation & visualization. For example complete some of the exercises at kaggle.
Do you know of other good Python implementations of tidyverse? If so let us know about them!