Data Science Fundamentals

Steph Locke

2018-01-25

Agenda

  • Business challenge
  • Process
  • Data & EDA
  • Sampling
  • Modelling
  • Evaluation
  • Operationalising
  • Monitoring

Steph Locke & Locke Data

Business Challenge

Business Challenge/goal

  • Increase customer profitability
  • Increase quantity of customers
  • Reduce overheads

Data science challenge

Find the lever you can push on to change behaviours that helps with business goal.

Getting started

Tips

  • Pick something only somewhat important and valuable to begin
  • Find many levers

Process

CRISP-DM

Data science routes

Getting started

Tips

  • Iterate
  • Prototype

Next steps

Data & EDA

Data

  • Do you have enough data?
  • What biases are in the data that you might end up reinforcing?
  • Have there been changes over time that mean the information means different things?
  • Does it actually measure what you think it’s measuring?

Extra data

  • Where can you get extra information from?
  • Do the join criteria work?
  • Will you be able to get it for production purposes?

Exploration

  • Analyse the heck out of that data!
  • Create extra “features”

Visualisation

Getting started

Tips

  • Data dictionaries
  • Code everything

Next steps

Example code

library(DBI)
library(odbc)

driver   = "ODBC Driver 13 for SQL Server"
server = "lockedata.westeurope.cloudapp.azure.com"
database = "datasci"
uid = "u001"
pwd = "HBBFSE"

dbConn<-dbConnect(odbc(),
          driver=driver, server=server,
          database=database, uid=uid,
          pwd=pwd)

Example code

library(tidyverse)
library(dbplyr)
 
flights<-tbl(dbConn,"flights")
carriers<-tbl(dbConn,"flights_carriers") 

flights %>% 
  inner_join(carriers)
## # Source: lazy query [?? x 20]
## # Database: Microsoft SQL Server 14.00.1000[u001@lockedata/datasci]
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  2  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  3  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  4  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  5  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  6  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
##  7  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
##  8  2013     1     1    558     600  -2.00   849    851 - 2.00 B6       49
##  9  2013     1     1    558     600  -2.00   853    856 - 3.00 B6       71
## 10  2013     1     1    559     600  -1.00   941    910  31.0  AA      707
## # ... with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, n <int>

Example code

library(skimr)
flights %>%
  collect %>% 
  skim()
## Skim summary statistics
##  n obs: 336776 
##  n variables: 19 
## 
## Variable type: character 
##  variable missing complete      n min max empty n_unique
##   carrier       0   336776 336776   2   2     0       16
##      dest       0   336776 336776   3   3     0      105
##    origin       0   336776 336776   3   3     0        3
##   tailnum    2512   334264 336776   5   6     0     4043
## 
## Variable type: integer 
##        variable missing complete      n    mean      sd   p0  p25 median
##        arr_time    8713   328063 336776 1502.05  533.26    1 1104   1535
##             day       0   336776 336776   15.71    8.77    1    8     16
##        dep_time    8255   328521 336776 1349.11  488.28    1  907   1401
##          flight       0   336776 336776 1971.92 1632.47    1  553   1496
##           month       0   336776 336776    6.55    3.41    1    4      7
##  sched_arr_time       0   336776 336776 1536.38  497.46    1 1124   1556
##  sched_dep_time       0   336776 336776 1344.25  467.34  106  906   1359
##            year       0   336776 336776 2013       0    2013 2013   2013
##   p75 p100     hist
##  1940 2400 <U+2581><U+2581><U+2583><U+2587><U+2586><U+2586><U+2587><U+2586>
##    23   31 <U+2587><U+2587><U+2587><U+2587><U+2586><U+2587><U+2587><U+2587>
##  1744 2400 <U+2581><U+2581><U+2587><U+2586><U+2586><U+2587><U+2586><U+2582>
##  3465 8500 <U+2587><U+2585><U+2582><U+2583><U+2582><U+2581><U+2581><U+2581>
##    10   12 <U+2587><U+2585><U+2587><U+2583><U+2585><U+2587><U+2585><U+2587>
##  1945 2359 <U+2581><U+2581><U+2582><U+2587><U+2586><U+2587><U+2587><U+2586>
##  1729 2359 <U+2581><U+2583><U+2587><U+2586><U+2586><U+2587><U+2587><U+2582>
##  2013 2013 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 
## Variable type: numeric 
##   variable missing complete      n    mean     sd  p0 p25 median  p75 p100
##   air_time    9430   327346 336776  150.69  93.69  20  82    129  192  695
##  arr_delay    9430   327346 336776    6.9   44.63 -86 -17     -5   14 1272
##  dep_delay    8255   328521 336776   12.64  40.21 -43  -5     -2   11 1301
##   distance       0   336776 336776 1039.91 733.23  17 502    872 1389 4983
##       hour       0   336776 336776   13.18   4.66   1   9     13   17   23
##     minute       0   336776 336776   26.23  19.3    0   8     29   44   59
##      hist
##  <U+2587><U+2587><U+2582><U+2583><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##  <U+2586><U+2587><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581>
##  <U+2581><U+2583><U+2587><U+2586><U+2585><U+2587><U+2587><U+2582>
##  <U+2587><U+2582><U+2583><U+2583><U+2585><U+2582><U+2583><U+2585>
## 
## Variable type: POSIXct 
##   variable missing complete      n        min        max     median
##  time_hour       0   336776 336776 2013-01-01 2013-12-31 2013-07-03
##  n_unique
##      6936

Sampling

Sampling basics

  • [OPTIONAL] Dataset for missing data
  • Dataset for building your model
  • Dataset for testing your model

Considerations

  • Balanced or unbalanced
  • Bootstrapping

Getting started

Tips

  • Make samples reproducible
  • Don’t double-dip!

Next steps

Example code

library(modelr)
flights %>% 
  resample_partition(c("train"=0.7,"test"=0.3))  ->
  samples

samples %>% 
  map(nrow)
## $train
## [1] 235743
## 
## $test
## [1] 101033

Modelling

Models

  • Supervised vs unsupervised
  • Parametric vs non-parametric

Models

  • Regression
  • Trees
  • Others

Candidate models

  • Simple model
  • Complex model
  • Different model types

Getting started

Tips

  • Models are cattle not pets

Next steps

Example code

samples %>% 
  pluck("train") %>% 
  lm(arr_delay ~ as.factor(month) + as.factor(day) + hour , data=.) ->
  initial_lm

initial_lm
## 
## Call:
## lm(formula = arr_delay ~ as.factor(month) + as.factor(day) + 
##     hour, data = .)
## 
## Coefficients:
##        (Intercept)   as.factor(month)2   as.factor(month)3  
##          -15.28463            -0.81240            -0.11731  
##  as.factor(month)4   as.factor(month)5   as.factor(month)6  
##            4.40661            -2.49110            10.46290  
##  as.factor(month)7   as.factor(month)8   as.factor(month)9  
##           10.53468            -0.09916            -9.92022  
## as.factor(month)10  as.factor(month)11  as.factor(month)12  
##           -6.02781            -5.64951             8.71773  
##    as.factor(day)2     as.factor(day)3     as.factor(day)4  
##           -0.94169            -3.31940            -9.11113  
##    as.factor(day)5     as.factor(day)6     as.factor(day)7  
##           -6.55287            -8.87207             2.02679  
##    as.factor(day)8     as.factor(day)9    as.factor(day)10  
##           11.16844             1.72884             7.07820  
##   as.factor(day)11    as.factor(day)12    as.factor(day)13  
##            3.19050             3.19297             2.20512  
##   as.factor(day)14    as.factor(day)15    as.factor(day)16  
##           -4.11497            -9.07249            -3.58548  
##   as.factor(day)17    as.factor(day)18    as.factor(day)19  
##            1.84259             2.82169             2.48053  
##   as.factor(day)20    as.factor(day)21    as.factor(day)22  
##           -6.24390            -4.83978            10.21620  
##   as.factor(day)23    as.factor(day)24    as.factor(day)25  
##            9.46657             3.37368             2.71522  
##   as.factor(day)26    as.factor(day)27    as.factor(day)28  
##           -4.20106            -3.86400             0.53802  
##   as.factor(day)29    as.factor(day)30    as.factor(day)31  
##           -7.59542            -6.91899            -4.44925  
##               hour  
##            1.67308

Evaluation

Critical Success Factors

  • False positives vs false negatives
  • Ranking
  • Aligns with experts

Data diving

  • Segments
  • Structural weaknesses
  • Test data

Getting started

Tips

  • Don’t just rely on single metric

Example code

library(broom)
initial_lm %>% 
  glance()
##    r.squared adj.r.squared    sigma statistic p.value df   logLik     AIC
## 1 0.06632623     0.0661551 43.16081  387.5659       0 43 -1188044 2376176
##       BIC  deviance df.residual
## 1 2376631 426858378      229142

Operationalising

Features

  • ETL for new data and calculations
  • What data quality stuff had to be done?

Model

  • How will you store the model?
  • Does it need versioning?
  • When will it need to be updated and how?

Technology

  • What’s the easiest way of getting live?
  • What’s the long term way of getting it live?
  • What’s your “bus factor”?

Getting started

Tips

  • KISS
  • Operationalising a model often takes longer than the modelling exercise (at least initially)

Next steps

  • Check out R in SQL Server
  • Check out Azure ML

Monitoring

Logging

  • Log results
  • Log all the things

Metrics

  • Measure the business lever & other KPIs
  • Set tolerances for negative impacts on other metrics
  • J-performance

Holdouts

  • Always have a control group

Getting started

Tips

  • Plan for monitoring, don’t make it an after-thought

Conclusion

Process

Tips

  • Pick something only somewhat important and valuable to begin
  • Find many levers
  • Iterate
  • Prototype
  • Data dictionaries
  • Code everything
  • Make samples reproducible
  • Don’t double-dip!

Tips

  • Models are cattle not pets
  • Don’t just rely on single metric
  • KISS
  • Operationalising a model often takes longer than the modelling exercise (at least initially)
  • Plan for monitoring, don’t make it an after-thought

Follow up