48235

dplyr::slice in data.table [duplicate]

Question:

This question already has an answer here:

<ul><li> <a href="/questions/16325641/how-to-extract-the-first-n-rows-per-group" dir="ltr" rel="nofollow">How to extract the first n rows per group?</a> <span class="question-originals-answer-count"> 2 answers </span> </li> <li> <a href="/questions/16573995/subset-by-group-with-data-table" dir="ltr" rel="nofollow">Subset by group with data.table</a> <span class="question-originals-answer-count"> 1 answer </span> </li> </ul>

What is the idiomatic way to do the action below in data.table?

library(dplyr) df %>% group_by(b) %>% slice(1:10)

I can do

library(data.table) df[, .SD[1:10] , by = b]

but that appears much slower. Is there a better way?

set.seed(0) df <- rep(1:500, sample(500:1000, 500, T)) %>% data.table(a = runif(length(.)) ,b = .) f1 <- function(df){ df %>% group_by(b) %>% slice(1:10) } f2 <- function(df){ df[, .SD[1:10] , by = b] } library(microbenchmark) microbenchmark(f1(df), f2(df)) #Unit: milliseconds # expr min lq mean median uq max neval # f1(df) 17.67435 19.50381 22.06026 20.50166 21.42668 78.3318 100 # f2(df) 69.69554 79.43387 119.67845 88.25585 106.38661 581.3067 100

========== Benchmarks with suggested methods ==========

set.seed(0) df <- rep(1:500, sample(500:1000, 500, T)) %>% data.table(a = runif(length(.)) ,b = .) use.slice <- function(df){ df %>% group_by(b) %>% slice(1:10) } IndexSD <- function(df){ df[, .SD[1:10] , by = b] } Index.I <- function(df) { df[df[, .I[seq_len(10)], by = b]$V1] } use.head <- function(df){ df[, head(.SD, 10) , by = b] } library(microbenchmark) microbenchmark(use.slice(df) , IndexSD(df) , Index.I(df) , use.head(df) , unit = "relative" , times = 100L) #Unit: relative # expr min lq mean median uq max neval # use.slice(df) 9.804549 10.269234 9.167413 8.900060 8.782862 6.520270 100 # IndexSD(df) 38.881793 42.548555 39.044095 38.636523 39.942621 18.981748 100 # Index.I(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 # use.head(df) 3.666898 4.033038 3.728299 3.408249 3.545258 3.951565 100

Answer1:

We can use .I to extract the row index and should be faster

out <- df[df[, .I[seq_len(10)], by = b]$V1] dim(out) #[1] 5000 2

Checking if there are NAs (as the OP commented)

any(out[, Reduce(`|`, lapply(.SD, is.na))]) #[1] FALSE dim(df) #[1] 374337 2 <h3>Benchmarks</h3> f3 <- function(df) { df[df[, .I[seq_len(10)], by = b]$V1] } microbenchmark(f1(df), f2(df), f3(df), unit = "relative", times = 10L) #Unit: relative # expr min lq mean median uq max neval cld # f1(df) 5.727822 5.480741 4.945486 5.672206 4.317531 5.10003 10 b # f2(df) 24.572633 23.774534 17.842622 23.070634 16.099822 11.58287 10 c # f3(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a

Recommend

  • Rcpp Function slower than Rf_eval
  • Difference between 'select' and '$' in R
  • jquery, masonry after append complete
  • enquo() inside a magrittr pipeline
  • Summing Non-Integers in Python sum([[1],[2]]) = [1,2]
  • getschema(“foreignkeys”) against SqlClient doesn't yield enough information
  • Datatable class in asp.net core
  • For each row check if value in one column exists in two other columns
  • Replace NA with last non-NA in data.table by using only data.table
  • Odd results of function speed tests
  • Bing Virtual Earth 7.0 calculate area
  • Why can't pass only 1 coulmn to glmnet when it is possible in glm function in R?
  • Dimension issue with scipy's curve_fit function
  • onCloseDialog event not working in my Controller. What's wrong with my code?
  • Python functions: Pass global variables if only accessing them?
  • Sending keystrokes from a C# application to a Java application - strange behaviour?
  • Build Matrix of Comparisons in SQl Server
  • D3 get axis values on zoom event
  • C: Incompatible pointer type initializing
  • Custom validator control occupying space even though display set to dynamic
  • JSON response opens as a file, but I can't access it with JavaScript
  • Where to put my custom functions in Wordpress?
  • Can I have the cursor start on a particular column by default in jqgrid's edit mode?
  • 'TypeError' while using NSGA2 to solve Multi-objective prob. from pyopt-sparse in OpenMDAO
  • How to limit post in wp_query
  • Websockets service method fails during R startup
  • How to get next/previous record number?
  • bootstrap to use multiple ng-app
  • Comma separated Values
  • How to delete a row from a dynamic generate table using jquery?
  • How to set the response of a form post action to a iframe source?
  • apache spark aggregate function using min value
  • python draw pie shapes with colour filled
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • Running Map reduces the dimensions of the matrices
  • Reading document lines to the user (python)
  • Binding checkboxes to object values in AngularJs
  • Net Present Value in Excel for Grouped Recurring CF
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize
  • How to load view controller without button in storyboard?