Data Science Fundamentals

Steph Locke

2018-06-28

Agenda

  • Business challenge
  • Process
  • Data & EDA
  • Sampling
  • Modelling
  • Evaluation
  • Operationalising
  • Monitoring

Steph Locke & Locke Data

Business Challenge

Business Challenge/goal

  • Increase customer profitability
  • Increase quantity of customers
  • Reduce overheads

Data science challenge

Find the lever you can push on to change behaviours that helps with business goal.

Getting started

Tips

  • Pick something only somewhat important and valuable to begin
  • Find many levers

Process

CRISP-DM

Data science routes

Getting started

Tips

  • Iterate
  • Prototype

Next steps

Data & EDA

Data

  • Do you have enough data?
  • What biases are in the data that you might end up reinforcing?
  • Have there been changes over time that mean the information means different things?
  • Does it actually measure what you think it’s measuring?

Extra data

  • Where can you get extra information from?
  • Do the join criteria work?
  • Will you be able to get it for production purposes?

Exploration

  • Analyse the heck out of that data!
  • Create extra “features”

Visualisation

Getting started

Tips

  • Data dictionaries
  • Code everything

Next steps

Example code

library(DBI)
library(odbc)

driver   = "ODBC Driver 13 for SQL Server"
server = "lockedata2.westeurope.cloudapp.azure.com"
database = "datasci"
uid = "lockedata"
pwd = "zll+.?=g8JA11111"


dbConn<-dbConnect(odbc(),
          driver=driver, server=server,
          database=database, uid=uid,
          pwd=pwd)

Example code

library(tidyverse)
library(dbplyr)
 
flights<-tbl(dbConn,"flights")
carriers<-tbl(dbConn,"flights_carriers") 

flights %>% 
  inner_join(carriers)
## # Source:   lazy query [?? x 20]
## # Database: Microsoft SQL Server 14.00.3015[dbo@lockedata2/datasci]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     5    30     1434           1435        -1     1545
##  2  2013     5    30     1441           1445        -4     1546
##  3  2013     5    30     1448           1455        -7     1607
##  4  2013     5    30     1455           1459        -4     1614
##  5  2013     5    30     1455           1459        -4     1609
##  6  2013     5    30     1521           1530        -9     1735
##  7  2013     5    30     1529           1530        -1     1832
##  8  2013     5    30     1551           1600        -9     1652
##  9  2013     5    30     1604           1610        -6     1749
## 10  2013     5    30     1604           1608        -4     1727
## # ... with more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, name <chr>

Example code

library(skimr)
flights %>%
  collect %>% 
  skim()
## Skim summary statistics
##  n obs: 336776 
##  n variables: 19 
## 
## -- Variable type:character -----------------------------------------------------
##  variable missing complete      n min max empty n_unique
##   carrier       0   336776 336776   2   2     0       16
##      dest       0   336776 336776   3   3     0      105
##    origin       0   336776 336776   3   3     0        3
##   tailnum    2512   334264 336776   5   6     0     4043
## 
## -- Variable type:integer -------------------------------------------------------
##        variable missing complete      n    mean      sd   p0  p25  p50
##        arr_time    8713   328063 336776 1502.05  533.26    1 1104 1535
##             day       0   336776 336776   15.71    8.77    1    8   16
##        dep_time    8255   328521 336776 1349.11  488.28    1  907 1401
##          flight       0   336776 336776 1971.92 1632.47    1  553 1496
##           month       0   336776 336776    6.55    3.41    1    4    7
##  sched_arr_time       0   336776 336776 1536.38  497.46    1 1124 1556
##  sched_dep_time       0   336776 336776 1344.25  467.34  106  906 1359
##            year       0   336776 336776 2013       0    2013 2013 2013
##   p75 p100     hist
##  1940 2400 <U+2581><U+2581><U+2583><U+2587><U+2586><U+2586><U+2587><U+2586>
##    23   31 <U+2587><U+2587><U+2587><U+2587><U+2586><U+2587><U+2587><U+2587>
##  1744 2400 <U+2581><U+2581><U+2587><U+2586><U+2586><U+2587><U+2586><U+2582>
##  3465 8500 <U+2587><U+2585><U+2582><U+2583><U+2582><U+2581><U+2581><U+2581>
##    10   12 <U+2587><U+2585><U+2587><U+2583><U+2585><U+2587><U+2585><U+2587>
##  1945 2359 <U+2581><U+2581><U+2582><U+2587><U+2586><U+2587><U+2587><U+2586>
##  1729 2359 <U+2581><U+2583><U+2587><U+2586><U+2586><U+2587><U+2587><U+2582>
##  2013 2013 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 
## -- Variable type:numeric -------------------------------------------------------
##   variable missing complete      n    mean     sd  p0 p25 p50  p75 p100
##   air_time    9430   327346 336776  150.69  93.69  20  82 129  192  695
##  arr_delay    9430   327346 336776    6.9   44.63 -86 -17  -5   14 1272
##  dep_delay    8255   328521 336776   12.64  40.21 -43  -5  -2   11 1301
##   distance       0   336776 336776 1039.91 733.23  17 502 872 1389 4983
##       hour       0   336776 336776   13.18   4.66   1   9  13   17   23
##     minute       0   336776 336776   26.23  19.3    0   8  29   44   59
##      hist
##  <U+2587><U+2587><U+2582><U+2583><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##  <U+2586><U+2587><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581>
##  <U+2581><U+2583><U+2587><U+2586><U+2585><U+2587><U+2587><U+2582>
##  <U+2587><U+2582><U+2583><U+2583><U+2585><U+2582><U+2583><U+2585>
## 
## -- Variable type:POSIXct -------------------------------------------------------
##   variable missing complete      n        min        max     median
##  time_hour       0   336776 336776 2013-01-01 2013-12-31 2013-07-03
##  n_unique
##      6936

Sampling

Sampling basics

  • [OPTIONAL] Dataset for missing data
  • Dataset for building your model
  • Dataset for testing your model

Considerations

  • Balanced or unbalanced
  • Bootstrapping

Getting started

Tips

  • Make samples reproducible
  • Don’t double-dip!

Next steps

Example code

library(modelr)
flights %>% 
  resample_partition(c("train"=0.7,"test"=0.3))  ->
  samples

samples %>% 
  map(nrow)
## $train
## [1] 235743
## 
## $test
## [1] 101033

Modelling

Models

  • Supervised vs unsupervised
  • Parametric vs non-parametric

Models

  • Regression
  • Trees
  • Others

Candidate models

  • Simple model
  • Complex model
  • Different model types

Getting started

Tips

  • Models are cattle not pets

Next steps

Example code

samples %>% 
  pluck("train") %>% 
  lm(arr_delay ~ as.factor(month) + as.factor(day) + hour , data=.) ->
  initial_lm

initial_lm
## 
## Call:
## lm(formula = arr_delay ~ as.factor(month) + as.factor(day) + 
##     hour, data = .)
## 
## Coefficients:
##        (Intercept)   as.factor(month)2   as.factor(month)3  
##           -15.4549             -0.7358             -0.1639  
##  as.factor(month)4   as.factor(month)5   as.factor(month)6  
##             5.0051             -2.4762             10.6554  
##  as.factor(month)7   as.factor(month)8   as.factor(month)9  
##            10.9202              0.4878             -9.8382  
## as.factor(month)10  as.factor(month)11  as.factor(month)12  
##            -6.1675             -5.6043              8.8604  
##    as.factor(day)2     as.factor(day)3     as.factor(day)4  
##            -1.2522             -3.5498             -9.4108  
##    as.factor(day)5     as.factor(day)6     as.factor(day)7  
##            -7.5698             -9.2946              1.6011  
##    as.factor(day)8     as.factor(day)9    as.factor(day)10  
##            11.8283              0.7988              7.2486  
##   as.factor(day)11    as.factor(day)12    as.factor(day)13  
##             2.6242              3.7184              1.6765  
##   as.factor(day)14    as.factor(day)15    as.factor(day)16  
##            -4.2501             -9.5330             -4.0620  
##   as.factor(day)17    as.factor(day)18    as.factor(day)19  
##             2.5303              1.8933              2.4178  
##   as.factor(day)20    as.factor(day)21    as.factor(day)22  
##            -6.4650             -5.3038              9.6782  
##   as.factor(day)23    as.factor(day)24    as.factor(day)25  
##             9.0755              2.9176              2.3482  
##   as.factor(day)26    as.factor(day)27    as.factor(day)28  
##            -4.3912             -3.6001              0.7508  
##   as.factor(day)29    as.factor(day)30    as.factor(day)31  
##            -8.2250             -7.2595             -5.0375  
##               hour  
##             1.7013

Evaluation

Critical Success Factors

  • False positives vs false negatives
  • Ranking
  • Aligns with experts

Data diving

  • Segments
  • Structural weaknesses
  • Test data

Getting started

Tips

  • Don’t just rely on single metric

Example code

library(broom)
initial_lm %>% 
  glance()
##    r.squared adj.r.squared    sigma statistic p.value df   logLik     AIC
## 1 0.06915259      0.068982 43.04348  405.3825       0 43 -1187638 2375364
##       BIC  deviance df.residual
## 1 2375819 424618625      229184

Operationalising

Features

  • ETL for new data and calculations
  • What data quality stuff had to be done?

Model

  • How will you store the model?
  • Does it need versioning?
  • When will it need to be updated and how?

Technology

  • What’s the easiest way of getting live?
  • What’s the long term way of getting it live?
  • What’s your “bus factor”?

Getting started

Tips

  • KISS
  • Operationalising a model often takes longer than the modelling exercise (at least initially)

Next steps

  • Check out R in SQL Server
  • Check out Azure ML

Monitoring

Logging

  • Log results
  • Log all the things

Metrics

  • Measure the business lever & other KPIs
  • Set tolerances for negative impacts on other metrics
  • J-performance

Holdouts

  • Always have a control group

Getting started

Tips

  • Plan for monitoring, don’t make it an after-thought

Conclusion

Process

Tips

  • Pick something only somewhat important and valuable to begin
  • Find many levers
  • Iterate
  • Prototype
  • Data dictionaries
  • Code everything
  • Make samples reproducible
  • Don’t double-dip!

Tips

  • Models are cattle not pets
  • Don’t just rely on single metric
  • KISS
  • Operationalising a model often takes longer than the modelling exercise (at least initially)
  • Plan for monitoring, don’t make it an after-thought

Follow up