36082

understanding difference in results between dplyr group_by vs tapply

Question:

I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online). Can anyone explain why the results are different, or how to obtain similar results?

library(dplyr) x <- iris x <- x %.% group_by(Species, Sepal.Width) %.% summarise (freq=n()) %.% summarise (mean_by_group = mean(Sepal.Width)) print(x) x <- iris x <- tapply(x$Sepal.Width, x$Species, mean) print(x)

Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr

library(dplyr) x <- iris x <- x %.% group_by(Species, Sepal.Width) %.% summarise (freq=n()) %.% mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.% print(x)

Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.

x <- iris %.% group_by(Species) %.% summarise(Sepal.Width = mean(Sepal.Width)) print(x)

Answer1:

Maybe this...

<h3>- dplyr:</h3> require(dplyr) iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width)) # Source: local data frame [3 x 2] # # Species mean_width # 1 setosa 3.428 # 2 versicolor 2.770 # 3 virginica 2.974 <h3>- tapply:</h3> tapply(iris$Sepal.Width, iris$Species, mean) # setosa versicolor virginica # 3.428 2.770 2.974 <hr /><h3><em>NOTE: tapply() simplifies output by default whereas summarise() does not:</em></h3> typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE)) # [1] "double"

it returns a list otherwise:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE)) # [1] "list"

So to actually get the <em>same type</em> of output form tapply() you would need:

tbl_df( data.frame( mean_width = tapply( iris$Sepal.Width, iris$Species, mean ))) # Source: local data frame [3 x 1] # # mean_width # setosa 3.428 # versicolor 2.770 # virginica 2.974

and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

Recommend

  • How to sum counts across tables that may contain partially different categories in R?
  • How to drop factors that have fewer than n members
  • R Applying a formular using a factor over a data frame
  • filter and unfilter in dplyr
  • How to group by two columns in R
  • R - dplyr bootstrap issue
  • Preprocessing csv files to use with tflearn
  • Reordering and reshaping columns in R [duplicate]
  • catch NAs using linear model with dplyr
  • in redux-form: how to retrieve values still not submitted
  • R do something after a warning (like tryCatch a warning, then edit an object)
  • ANOVA on multiple responses, by multiple groups NOT part of formula
  • TypeError: unsupported operand type(s) for -: 'str' and 'str' in python 3.x Anac
  • LDA: Why sampling for inference of a new document?
  • Removing Labels from Legend in ggplot2
  • How to make Plotly chart with year mapped to line color and months on x-axis
  • How do I generate a Sine Sweep in Java (Android)
  • rapply over a nested list in R
  • MySQL: Difference between `… ADD INDEX(a); … ADD INDEX(b);` and `… ADD INDEX(a,b);`?
  • Is there a way to call library thread-local init/cleanup on thread creation/destruction?
  • Accessing Rows In A LINQ Result Without A Foreach Loop?
  • Use tryCatch within R loop
  • Javascript convert timezone issue
  • How to show dropdown in excel using jrxml (jasper api)?
  • align graphs with different xlab
  • Return words with double consecutive letters
  • php design question - will a Helper help here?
  • retrieve vertices with no linked edge in arangodb
  • using conditional logic : check if record exists; if it does, update it, if not, create it
  • AngularJs get employee from factory
  • NSLayoutConstraint that would pin a view to the bottom edge of a superview
  • How get height of the a view with gone visibility and height defined as wrap_content in xml?
  • Understanding cpu registers
  • embed rChart in Markdown
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • Authorize attributes not working in MVC 4
  • python draw pie shapes with colour filled
  • Add sale price programmatically to product variations
  • Reading document lines to the user (python)
  • Python/Django TangoWithDjango Models and Databases