72436

reshape alternating columns in less time and using less memory

How can I do this reshape faster and so that it takes up less memory? My aim is to reshape a dataframe that is 500,000 rows by 500 columns with 4 Gb RAM.

Here's a function that will make some reproducible data:

make_example <- function(ndoc, ntop){ # doc numbers V1 = seq(1:ndoc) # filenames V2 <- list("vector", size = ndoc) for (i in 1:ndoc){ V2[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='') } # topic proportions tvals <- data.frame(matrix(runif(1:(ndoc*ntop)), ncol = ntop)) # topic number tnumvals <- data.frame(matrix(sample(1:ntop, size = ndoc*ntop, replace = TRUE), ncol = ntop)) # now make topic props and topic numbers alternating columns (rather slow!) alternating <- data.frame(c(matrix(c(tnumvals, tvals), 2, byrow = T)) ) # make colnames for topic number and topic props ntopx <- sapply(1:ntop, function(j) paste0("ntop_",j)) ptopx <- sapply(1:ntop, function(j) paste0("ptop_",j)) tops <- c(rbind(ntopx,ptopx)) # make data frame dat <- data.frame(V1 = V1, V2 = unlist(V2), alternating) names(dat) <- c("docnum", "filename", tops) # give df as result return(dat) }

Make some reproducible data:

set.seed(007) dat <- make_example(500000, 500)

Here's my current method (thanks to https://stackoverflow.com/a/8058714/1036500):

library(reshape2) NTOPICS = (ncol(dat) - 2 )/2 nam <- c('num', 'text', paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = "")) system.time( dat_l2 <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long', sep = "")) system.time( dat.final2 <- dcast(dat_l2, dat_l2[,2] ~ dat_l2[,3], value.var = "proportion" ) )

Some timings, just for the reshape since that's the slowest step:

make_example(5000,100) = 82 sec

make_example(50000,200) = 2855 sec (crashed on attempting the second step)

make_example(500000,500) = not yet possible...

What other methods are there that are faster and less memory intensive for this reshape (data.table, this)?

Answer1:

I doubt very much that this will succeed with that small amount of RAM when passing a 500000 x 500 dataframe. I wonder whether you could do even simple actions in that limited space. Buy more RAM. Furthermore, reshape2 is slow, so use stats::reshape for big stuff. And give it hints about what the separator is.

> set.seed(007) > dat <- make_example(5, 3) > dat docnum filename ntop_1 ptop_1 ntop_2 ptop_2 ntop_3 ptop_3 1 1 y8214 3 0.06564574 1 0.6799935 2 0.8470244 2 2 e6x39 2 0.62703876 1 0.2637199 3 0.4980761 3 3 34c19 3 0.49047504 3 0.1857143 3 0.7905856 4 4 1H0y6 2 0.97102441 3 0.1851432 2 0.8384639 5 5 P6zqy 3 0.36222085 3 0.3792967 3 0.4569039 > reshape(dat, direction="long", varying=3:8, sep="_") docnum filename time ntop ptop id 1.1 1 y8214 1 3 0.06564574 1 2.1 2 e6x39 1 2 0.62703876 2 3.1 3 34c19 1 3 0.49047504 3 4.1 4 1H0y6 1 2 0.97102441 4 5.1 5 P6zqy 1 3 0.36222085 5 1.2 1 y8214 2 1 0.67999346 1 2.2 2 e6x39 2 1 0.26371993 2 3.2 3 34c19 2 3 0.18571426 3 4.2 4 1H0y6 2 3 0.18514322 4 5.2 5 P6zqy 2 3 0.37929675 5 1.3 1 y8214 3 2 0.84702439 1 2.3 2 e6x39 3 3 0.49807613 2 3.3 3 34c19 3 3 0.79058557 3 4.3 4 1H0y6 3 2 0.83846387 4 5.3 5 P6zqy 3 3 0.45690386 5 > system.time( dat <- make_example(5000,100) ) user system elapsed 2.925 0.131 3.043 > system.time( dat2 <- reshape(dat, direction="long", varying=3:202, sep="_")) user system elapsed 16.766 8.608 25.272

I'd say that around 1/5 of total in 32 GB memory got used during that process that was 250 times smaller than your goal, so I'm not surprised that your machine hung. (It should not have "crashed". The authors of R would prefer that you give accurate descriptions of behavior and I suspect the R process stopped responding when it paged into virtual memory.) I have performance issues that I need to work around with a dataset that is 7 million records x 100 columns when using 32 GB.

Recommend

  • Plot two variables in the same histogram with ggplot
  • Stacked bar plot, label bars with percentage values [duplicate]
  • Pivoting data in R
  • C# - Random number with seed
  • Use conditional coloring on a plotly surface
  • Implementation of random number generator [duplicate]
  • quick pandas groupby calculations with cumprod
  • Aggregating two data frame columns without any existing pattern logic
  • How to stop a goroutine that is listening for RethinkDB changefeeds?
  • Using : for multiple slicing in list or numpy array
  • Making Cross Site Asynchronous HTTP Post from GWT Client
  • Connect Node.js with Oracle on Windows platform
  • iOS 6 dateFromString returns wrong date
  • List images(01.png) and descriptions(01.txt) from directory
  • how to upload multiple files in c# windows application
  • Breaking out column by groups in Pandas
  • Remove final comma from string in vb.net
  • onBackPressed() not being executed
  • How to use carriage return with multiple line?
  • debug library loaded with ctypes using gdb
  • Combining SpatialPolygonsDataFrame of two neighbour countries
  • Refering to the class itself from within a class mehod in Objective C
  • How to install a .deb file on a jailbroken iphone programmatically?
  • Handling un-mapped Rest path
  • R - Combining Columns to String Based on Logical Match
  • Linq Objects Group By & Sum
  • Optimizing database types to compact database (SQLite)
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Arrow is showed instead of the material design version hamburger icon. Why doesn't syncState in
  • Adding custom controls to a full screen movie
  • To display the title for the current loaction in map in iphone
  • Why winpcap requires both .lib and .dll to run?
  • Invalid access key error using credentials redeemed from an amazon open id token
  • PHP: When would you need the self:: keyword?
  • Buffer size for converting unsigned long to string
  • Acquiring multiple attributes from .xml file in c#
  • How to set the response of a form post action to a iframe source?
  • Change div Background jquery
  • Qt: Run a script BEFORE make
  • How can I use threading to 'tick' a timer to be accessed by other threads?