36209

Error using select function in R [duplicate]

Question:

This question already has an answer here:

<ul><li> <a href="/questions/22209706/count-5-highest-values-of-a-variable" dir="ltr" rel="nofollow">Count 5 highest values of a variable</a> <span class="question-originals-answer-count"> 2 answers </span> </li> </ul>

I want to get the song that user's play most frequently. The three fields I want in the csv file are userId,songId and playCount but the select function is giving an error:

write.csv(group_by(mydata,userId) %.% summarise(one=max(playCount)) %.% select(userId,songId,playCount), file="FavouriteSongs.csv") Error in eval(expr, envir, enclos) : object 'songId' not found

An example of the data looks like this

userId songId playCount A 568r 85 A 711g 18 C 34n 18 E 454j 65 D 663a 72 B 35d 84 A 34c 72 A 982s 65 E 433f 11 A 565t 7

Thanks in advance

Answer1:

In your chained sequence of dplyr operations, the summarise call will produce two columns: the grouping variable and the result of the summary function.

df %.% group_by(userId) %.% summarise( one = max(playCount)) # Source: local data frame [5 x 2] # # userId one # 1 A 85 # 2 B 84 # 3 C 18 # 4 D 72 # 5 E 65

When you then try to select the songID variable from the data frame generated by summarise, the songID variable is not found.

df %.% group_by(userId) %.% summarise( one = max(playCount)) %.% select(userId, songId, playCount) # Error in eval(expr, envir, enclos) : object 'songId' not found

A more suitable dplyr function in this case is filter. Here we select rows where the condition playCount == max(playCount) is TRUE <em>within</em> each group.

df %.% group_by(userId) %.% filter( playCount == max(playCount)) # Source: local data frame [5 x 3] # Groups: userId # # userId songId playCount # 1 A 568r 85 # 2 C 34n 18 # 3 E 454j 65 # 4 D 663a 72 # 5 B 35d 84

You find several nice <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html" rel="nofollow"><strong>dplyr examples here</strong></a>.

Answer2:

I do not down-vote, generally, but this question is basic, reveals no investigation, is somewhat replicated and the solution is easily found elsewhere.

There are several ways to achieve this.

being d your data.frame. retrieve the row with the most played song:

d[d$playCount == max(d$playCount), ]

For most played by user, try this

d <- data.frame(userId = rep(seq(1:5),2) , songId = letters[1:10], playCount = c(10:19)) > d userId songId playCount 1 1 a 10 2 2 b 11 3 3 c 12 4 4 d 13 5 5 e 14 6 1 f 15 7 2 g 16 8 3 h 17 9 4 i 18 10 5 j 19 d2<- d[order(-d$playCount), ] dout <- d2[!duplicated(d2$userId), ] > dout userId songId playCount 10 5 j 19 9 4 i 18 8 3 h 17 7 2 g 16 6 1 f 15

I really don't understand the down-vote. The approach is correct and is fast, almost as fast as dplyr. Try it with a 1000000 rows data frame

df <- data.frame(userId = rep(seq(1:5),100000) , songId = rep(letters[1:10], 100000), playCount = runif(1000000,10,20))

using @Henrik dplyr approach

system.time(df %.% group_by(userId) %.% filter( playCount == max(playCount))) Source: local data frame [5 x 3]

Groups: userId

userId songId playCount 1 2 b 19.99995 2 5 j 19.99982 3 1 f 19.99981 4 4 d 19.99995 5 3 h 19.99999 user system elapsed 0.08 0.02 0.09

and using <a href="https://stackoverflow.com/a/5820329/640783" rel="nofollow">Hadley</a> approach

df2<- df[order(-df$playCount), ] dout <- df2[!duplicated(df2$userId), ] > dout userId songId playCount 671528 3 h 19.99999 466824 4 d 19.99995 185512 2 b 19.99995 249190 5 j 19.99982 455746 1 f 19.99981 system.time(dout <- df2[!duplicated(df2$userId), ]) user system elapsed 0.13 0.00 0.12

Now I'd suggest you to up-vote two brilliant approaches, from Hadley <a href="https://stackoverflow.com/a/5820329/640783" rel="nofollow">here</a> and from Gavin Simpson <a href="https://stackoverflow.com/a/5805340/640783" rel="nofollow">here</a>.

Recommend

  • Count row with condition per group
  • Pandas split array based on condition
  • How to group a list of lists by date using Linq?
  • How to get rows with min values in one column, grouped by other column, while keeping other columns?
  • Use conditional coloring on a plotly surface
  • How to Optimize mach_msg_trap
  • Get last insert id of Postgresql
  • Is it possible to generate a unique numeric value for each row in an iSeries table without looping?
  • C++ - Is destructor called when a vector holds objects?
  • Python cosine function precision [duplicate]
  • vectorized indexing/slicing in numpy/scipy?
  • Rest Services conventions
  • Primefaces :radioButton inside a ui:repeat
  • Multiple producers single consumer locking schema
  • R convert summary result (statistics with all dataframe columns) into dataframe
  • Linq Merge lists
  • Breaking out column by groups in Pandas
  • Access Android Market through SSH tunnel
  • Unable to get column index with table.getColumn method using custom table Model
  • Can you perform a UNION without a subquery in SQLAlchemy?
  • ImageMagick, replace semi-transparent white with opaque white
  • Convert array of 8 bytes to signed long in C++
  • php design question - will a Helper help here?
  • Use group_by to filter specific cases while keeping NAs
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • InvalidAuthenticityToken between subdomains when logging in with Rails app
  • KeystoneJS: Relationships in Admin UI not updating
  • AngularJs get employee from factory
  • trying to dynamically update Highchart column chart but series undefined
  • Why joiner is not used after Sequence generator or Update statergy
  • embed rChart in Markdown
  • Change div Background jquery
  • How does Linux kernel interrupt the application?
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • Authorize attributes not working in MVC 4
  • Busy indicator not showing up in wpf window [duplicate]
  • How to get NHibernate ISession to cache entity not retrieved by primary key
  • How can I use `wmic` in a Windows PE script?
  • Unable to use reactive element in my shiny app
  • Why do underscore prefixed variables exist?