With my HIBPwned package, I consume the HaveIBeenPwned API and return back a list object with an element for each email address. Each element holds a data.frame of breach data or a stub response with a single column data.frame containing NA. Elements are named with the email addresses they relate to. I had a list of data.frames and I wanted a consolidated data.frame (well, I always want a data.table).

Enter data.table …

data.table has a very cool, and very fast function named rbindlist(). This takes a list of data.frames and consolidates them into one data.table, which can, of course, be handled as a data.frame if you didn’t want to use data.table for anything else.

Prep

For this post, I need a list with some data.frames in it. For simplicity, I’m going to split the iris dataset into three separate data.frames.

library(data.table)
myList<-list( p1=iris[1:50,]
             , p2=iris[51:100,]
             , p3=iris[101:150,]
 )

Simplest usage

If all your tables have the same columns and in the same order, then this is super simple.

> myList<-list( p1=iris[1:50,] , p2=iris[51:100,] , p3=iris[101:150,] )
> dt<-rbindlist(myList)
> head(dt)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa
6:          5.4         3.9          1.7         0.4  setosa

Varying columns

If your data structure varies at all, you can use the arguments use.names and fill to combine the data.frames without an error. It will put columns on the RHS of the table as they appear within the list.

Here I remove the first column from the first part of our dataset to illustrate.

> myList<-list( p1=iris[1:50, -1 ] , p2=iris[51:100,] , p3=iris[101:150,] )
> dt<-rbindlist(myList, use.names=TRUE, fill=TRUE)
> head(dt)
   Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1:         3.5          1.4         0.2  setosa           NA
2:         3.0          1.4         0.2  setosa           NA
3:         3.2          1.3         0.2  setosa           NA
4:         3.1          1.5         0.2  setosa           NA
5:         3.6          1.4         0.2  setosa           NA
6:         3.9          1.7         0.4  setosa           NA

Adding the list element ID

I didn’t RTFM very well at first, so when I wanted the name of the element the specific row came from to be present in my data.frame, I (shudder) wrote this horror:

dt<-rbindlist(lapply(1:length(myList)
                     , function(x){ setDT(myList[[x]])[
  , id:=names(myList)[x]]})
  , use.names=TRUE, fill=TRUE)

When I write things like that, I know I’m in the wrong! This hideous inline function, iterating through the elements and updating values in the element, is frankly something I should be slapped for. So with it playing on my mind, I went back to the manual, and lo’ data.table was awesome.

> myList<-list( p1=iris[1:50, -1 ] , p2=iris[51:100,] , p3=iris[101:150,] ) 
> dt<-rbindlist(myList
+   , use.names=TRUE, fill=TRUE, idcol="myList")
> 
> head(dt)
   myList Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1:     p1         3.5          1.4         0.2  setosa           NA
2:     p1         3.0          1.4         0.2  setosa           NA
3:     p1         3.2          1.3         0.2  setosa           NA
4:     p1         3.1          1.5         0.2  setosa           NA
5:     p1         3.6          1.4         0.2  setosa           NA
6:     p1         3.9          1.7         0.4  setosa           NA

This used the names of the list elements but would use the numeric index number if names weren’t available.

Wrap up

So with just three little arguments, data.table’s rbindlist() bundled up my data.frames into a single data.frame whilst handling divergent columns, and providing a reference back to the original list element!

R Quick Tip: Collapse a lists of data.frames with data.table
Tagged on:         

3 thoughts on “R Quick Tip: Collapse a lists of data.frames with data.table

  • 09/04/2016 at 10:19
    Permalink

    dplyr::bind_rows() is also nice

    Reply
    • 09/04/2016 at 15:31
      Permalink

      Good shout 🙂

      This works quite neatly

      library(dplyr)
      myList<-list( p1=iris[1:50,] , p2=iris[51:100,] , p3=iris[101:150,] )
      bind_rows(myList)
      
      myList<-list( p1=iris[1:50, -1 ] , p2=iris[51:100,] , p3=iris[101:150,] )
      bind_rows(myList)
      bind_rows(myList,.id="id")
      
      Reply
      • 09/04/2016 at 15:59
        Permalink

        Did some benchmarking based on this SO post because people like to know about speed. On my small example bind_rows looks to win out but over larger volumes like exemplified in the SO post, rbindlist performs better.

        > library(dplyr)
        > library(data.table)
        > library(microbenchmark)
        > 
        > myList<-list( p1=iris[1:50, -1 ] , p2=iris[51:100,] , p3=iris[101:150,] )
        > 
        > microbenchmark(
        + bind_rows(myList,.id="id"),
        + rbindlist(myList, use.names=TRUE, fill=TRUE, idcol="myList")
        + )
        Unit: microseconds
                                                                       expr     min       lq     mean   median       uq
                                              bind_rows(myList, .id = "id")  89.508 108.9865 149.5429 145.3925 182.4940
         rbindlist(myList, use.names = TRUE, fill = TRUE, idcol = "myList") 256.001 292.1750 401.6661 417.1605 474.8995
              max neval cld
          277.334   100  a 
         1210.437   100   b
        > 
        > listOfDataFrames <- vector(mode = "list", length = 100)
        > 
        > for (i in 1:100) {
        +   listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T),
        +                                       b=rnorm(500), c=rnorm(500))
        + }
        > microbenchmark(
        +   do.call("rbind", listOfDataFrames),
        +   plyr::rbind.fill(listOfDataFrames), 
        +   plyr::ldply(listOfDataFrames, data.frame),
        +   data.table::rbindlist(listOfDataFrames),
        +   dplyr::bind_rows(listOfDataFrames)
        + )
        Unit: milliseconds
                                              expr       min        lq      mean    median         uq        max neval cld
                do.call("rbind", listOfDataFrames) 49.242945 54.944516 96.956251 62.222668 142.651031 160.632730   100   c
                plyr::rbind.fill(listOfDataFrames) 12.587143 14.870739 16.334073 15.337754  16.137987 101.429892   100  b 
         plyr::ldply(listOfDataFrames, data.frame) 17.319901 19.629932 21.911172 20.152599  20.951209 107.047521   100  b 
           data.table::rbindlist(listOfDataFrames)  1.835597  2.034321  2.729367  2.196177   2.896003   6.139369   100 a  
                dplyr::bind_rows(listOfDataFrames)  3.598844  3.897279  4.646169  4.188294   5.194904   7.547834   100 a  
        
        Reply

What do you think?