Cut the R Learning Curve

Steph Locke (@theStephLocke)

2018-06-29

Agenda

Agenda

  • R
  • R + Microsoft
  • RefresheR
  • Super R
  • R for Reporting
  • Develop R
  • Administer R
  • Wrap up

R

R

R is an integrated suite of software facilities for data manipulation, calculation and graphical display

  • Open source
  • In-memory & single-core (by default)
  • Multi-platform
  • Extensible environment
  • Delivered by the R Foundation, supported by the R Consortium, grown by R developers
  • r-project.org

Setup on Windows

  • Get R
  • Get R Tools
  • Get RStudio or Visual Studio + VS R Tools

There’s a package for that

How it hangs together

How it hangs together

Next steps

  1. Check out the R website
  2. Install R & RStudio (or R Tools for Visual Studio)

R + Microsoft

Microsoft R Server

  • Formerly, Revolution R for Enterprise
  • Available on Windows and Linux
  • Specialised connectors available for SQL Server, Hadoop, Oracle, & Teradata

AzureML

  • Uses R throughout

PowerBI

  • R as a data source
  • Use R for data visualisations
  • Any R variant

SQL Server

  • Call R from inside SQL

RefresheR

Syntax

Action Operator Example
Create / update a variable <- a <- 10
Comment # # This is my comment
Help ? ?mean
Identifier ` `1`<-2
Get a component e.g a data.frame column $ iris$Sepal.Length
Reference positions within an object [ ] iris[ 1 , 1]

Objects

  • Vector
  • Matrix
  • Array
  • List
  • data.frame

Next steps

Super R

data.table

data.table

  1. Super-fast slicing & dicing
  2. Low memory footprint vs data.frames
  3. Fast joins
  4. Auto-indexing
  5. Many saved characters!
  6. Active dev

data.table

Task How
Read CSV irisDT <- fread(“iris.csv”)
Return everything irisDT irisDT[ ]
Select columns irisDT[ , .(Sepal.Length, Sepal.Width) ]
Restrict rows irisDT[ Sepal.Length >=5 , ]
Aggregate irisDT[ , mean(Sepal.Length)]
Aggregate by group irisDT[ , mean(Sepal.Length) , Species ]
Count irisDT[ , .N ]

The “Hadleyverse”

Hadley Wickham, insanely prolific developer of R packages has produced a great ecosystem:

  • httr
  • ggplot2
  • purrr
  • readxl
  • haven
  • dplyr
  • rvest

dplyr

  • filter() (and slice())
  • arrange()
  • select() (and rename())
  • distinct()
  • mutate() (and transmute())
  • summarise()
  • sample_n() and sample_frac()

dplyr

  1. Relatively clear verbs
  2. Quite easy to get started with
  3. Verbose

ggplot2

ggplot2

  1. Clean conceptual implementation
  2. Highly customisable
  3. Simple to start with
  4. Big ecosystem

ggplot2

Term Explanation Example(s)
plot A plot using the grammar of graphics ggplot()
aesthetics attributes of the chart colour, x, y
mapping relating a column in your data to an aesthetic
statistical transformation a translation of the raw data into a refined summary stat_density()
geometry the display of aesthetics geom_line(), geom_bar()
scale the range of values axes, legends
coordinate system how geometries get laid out coord_flip()
facet a means of subsetting the chart facet_grid()
theme display properties theme_minimal()

caret

A single point of contact for myriad statistical & machine learning algorithms

AzureML

Interact with AzureML in R

  1. No GUI
  2. All the development environments of R
  3. Consume AzureML webservice

rmarkdown

Write markdown, interweave code

  1. Data provenance
  2. Ease of documenting & developing solutions
  3. Extendible & customisable

miniCRAN

Make your own CRAN

  1. R interface
  2. Control over packages
  3. Internal package deployment

Next steps

  1. Play with the popular packages
  2. Look at CRAN task views
  3. Ask for recommendations!

R for Reporting

Types of reports

  • Interactive
  • Static

Interactive reports

  • shiny
  • flexdashboards

Static reports

  • rmarkdown
  • knitr

I’m going to show off

Develop R

Utility packages

  • testthat
  • devtools (useful for other things)
  • microbenchmark
  • packrat

Package development

library(devtools)
pkg<-"newPackage"
create(pkg)

library(devtools) # Open the project!
add_test_infrastructure() # Add unit test framework
add_travis() # Add CI framework
use_vignette() # Add folder for macro-level help files
use_package_doc() # Add file for providing info about your package
use_readme_rmd() # Make a README!

What to unit test

  1. A single set of values that represent “normal” and expect this matches a correct answer
  2. A dataset that represents “normal” and expecting this to match correct answers
  3. Various permutations of bad input values and expect errors (ideally specific error messages)
  4. Edge cases that cover extreme values or boundaries for any inequalities or conditions
  5. Any bugs or compatibility issues

How to test

myfunc<-function(a=1,b=2,c="blah"){
stopifnot(is.numeric(a), is.numeric(b), is.character(c))
d<-ifelse(a<0,a*b,a/b)
e<-paste0(c,d)
return(e)
}
library(testthat)
# Add a high-level name for group of tests, typically the function name
context("myfunc")

# Simplest test
test_that("Defaults return expected result",{
  result<-myfunc()
  check<-"blah0.5"
  expect_equal(result,check)
})

Next steps

  1. Read Hadley’s Package Development guide
  2. Practice!

Administer R

Infrastructure

  • Windows or Linux?
  • Linux!
  • Because …
  • RStudio Server
  • System Dependencies

Infrastructure

  • High RAM
  • CPU should be decent
  • Physical storage can be low
  • Virus scanning not a problem
  • AD auth (fairly simple)
  • File share availability
  • Backups or scripted builds?

R

  • Microsoft R or base R?
  • It depends!
  • Need cutting edge R & packages?
  • Need extra computational speed or connectors?

R & Packages

  • Regular updates required
  • Unit tests of key behaviours should be done
  • Dev environments always a plus!
  • miniCRAN

Deployment

  • Ad-hoc facilities
  • Via Continuous Deployment

Next steps

Wrap up

Covered

  • R overview
  • R + Microsoft
  • RefresheR
  • Super R
  • R for Reporting
  • Develop R
  • Administer R

Next steps

  • Get the deck itsalocke.com/talks
  • Follow up the Next Steps sections
  • Let me know how you get on: @theStephLocke

Q&A