12249

sum and conditionally count based on a second column

I have gotten frustrated trying to solve this seemingly simple problem. I have a dataset (df) like this:

structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L), Unknown = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), Temp = c(21L, 21L, 21L, 23L, 23L, 21L, 21L, 22L, 21L, 23L, 23L, 22L, 21L, 21L, 22L, 22L, 21L, 21L, 23L, 23L), Obs = structure(c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("mdk", "sde"), class = "factor"), State = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ma", class = "factor"), Zone = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Segment = c(8L, 7L, 4L, 17L, 18L, 7L, 2L, 12L, 1L, 17L, 18L, 12L, 9L, 7L, 13L, 11L, 8L, 9L, 17L, 18L), Subseg = c(1L, 3L, 3L, 2L, 2L, 2L, 4L, 0L, 10L, 4L, 2L, 0L, 1L, 1L, 3L, 1L, 2L, 2L, 1L, 1L), Wdir = structure(c(2L, 2L, 1L, 3L, 3L, 2L, 2L, 1L, 2L, 3L, 3L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("na", "ne", "nw"), class = "factor"), Wvel = structure(c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("5", "na"), class = "factor"), Clouds = structure(c(1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 3L), .Label = c("1", "4", "na"), class = "factor"), Temp.1 = structure(c(1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 3L), .Label = c("20", "25", "na"), class = "factor"), Species = structure(c(7L, 21L, 1L, 21L, 16L, 4L, 16L, 6L, 1L, 17L, 5L, 7L, 5L, 1L, 1L, 6L, 7L, 7L, 24L, 5L), .Label = c("ABDU", "ABDU", "ABDU", "ABDU", "ABDU", "CAGO", "CAGO", "CAGO", "CAGO", "CAGO", "GOLD", "GOLD", "GOLD", "GOLD", "GOLD", "MERG", "MERG", "MERG", "MERG", "MERG", "SCOT", "SCOT", "SCOT", "SCOT", "SCOT", "SCOT", "SCOT"), class = "factor"), Count = c(5L, 1L, 150L, 3L, 20L, 8L, 5L, 10L, 5L, 1L, 20L, 10L, 2L, 2L, 80L, 40L, 1L, 1000L, 2L, 20L)), .Names = c("Year", "Unknown", "Temp", "Obs", "State", "Zone", "Segment", "Subseg", "Wdir", "Wvel", "Clouds", "Temp.1", "Species", "Count"), row.names = c(666L, 614L, 2060L, 1738L, 1459L, 536L, 197L, 2467L, 98L, 1794L, 1449L, 2464L, 696L, 483L, 2644L, 2350L, 686L, 844L, 2989L, 2934L), class = "data.frame")

With a header that looks like this:

Year Unknown Temp Obs State Zone Segment Subseg Wdir Wvel 666 2015 1 21 mdk ma 2 8 1 ne 5 614 2015 1 21 mdk ma 2 7 3 ne 5 2060 2015 1 21 sde ma 2 4 3 na na 1738 2015 1 23 mdk ma 2 17 2 nw 5 1459 2015 1 23 mdk ma 2 18 2 nw 5 536 2015 1 21 mdk ma 2 7 2 ne 5 Clouds Temp.1 Species Count 666 1 20 CAGO 5 614 1 20 SCOT 1 2060 na na ABDU 150 1738 1 20 SCOT 3 1459 1 20 MERG 20 536 1 20 ABDU 8

Among other things within dplyr, I want to get a sum of each species as a new column, when I am grouping by segment. This is the final code I have tried with many variations.

df_group = df %>% group_by(Segment) %>% summarise(temp = round(mean(Temp)), WDir = round(mean(Wdir)), ABDU = sum(which(Species=="ABDU"),Count), CAGO = sum(which(Species=="CAGO"),Count), GOLD = sum(which(Species=="GOLD"),Count), MERG = sum(which(Species=="MERG"),Count), SCOT = sum(which(Species=="SCOT"),Count))

And this is what I get (to show correct format):

Segment temp WDir ABDU CAGO GOLD MERG SCOT 1 1 21 2 6 5 5 5 5 2 2 21 2 5 5 5 6 5 3 4 21 1 151 150 150 150 150 4 7 21 2 16 11 11 11 12 5 8 21 2 6 9 6 6 6 6 9 21 2 1003 1004 1002 1002 1002

The format and general idea are what I want, but the numbers are not adding up the way I want them to. I'm sure it is simple but need some help! Thanks.

Answer1:

The problem is that which returns a vector of the positions, but you're not using those to subset. So the sum you are getting is of the positions which are true in addition to the count variable. e.g.

x <- c("a", "b", "b") count <- c(10, 11, 12) sum(which(c("a", "b", "b") == "b"), count) # 38 because it is 2 + 3 + 10 + 11 + 12

I believe what you want is (or at least one way of writing it):

sum(ifelse(x == "b", count, 0)) # 23 because it is equal to 0 + 11 + 12

Translating into dplyr syntax, your example could look like this:

df_group = df %>% group_by(Segment) %>% summarise(temp = round(mean(Temp)), WDir = round(mean(Wdir)), ABDU = sum(ifelse(Species=="ABDU", Count, 0L)), CAGO = sum(ifelse(Species=="CAGO", Count, 0L)), GOLD = sum(ifelse(Species=="GOLD", Count, 0L)), MERG = sum(ifelse(Species=="MERG", Count, 0L)), SCOT = sum(ifelse(Species=="SCOT", Count, 0L)))

Answer2:

Another approach, in case you don't want to type out the sum for all your species:

library(reshape2) library(dplyr) # I had a problem with duplicate factor levels from your dput, # so I re-factored species df$Species = as.factor(as.character(df$Species)) species.counts = select(df, Segment, Species, Count) %>% dcast(formula = Segment ~ Species, value.var = "Count", fun.aggregate = sum) > head(species.counts) Segment ABDU CAGO MERG SCOT 1 1 5 0 0 0 2 2 0 0 5 0 3 4 150 0 0 0 4 7 10 0 0 1 5 8 0 6 0 0 6 9 2 1000 0 0 df %>% group_by(Segment) %>% summarise(temp = round(mean(Temp))) %>% left_join(species.counts) Source: local data frame [11 x 6] Segment temp ABDU CAGO MERG SCOT 1 1 21 5 0 0 0 2 2 21 0 0 5 0 3 4 21 150 0 0 0 4 7 21 10 0 0 1 5 8 21 0 6 0 0 6 9 21 2 1000 0 0

I also couldn't do the wind direction average, because your dput data only has that as a factor with the directions, not like the head() you showed, but the technique generalizes.

Recommend

  • Microsoft Azure- Vault deletion error
  • A right way to work with [closed]
  • Parse data from OpenWeatherMap, using retrofit2 and rxJava2
  • Does AWS offer inter-region / cross region VPC Peering?
  • mutate_each_ non-standard evaluation
  • Create new variable in R data frame by conditional lookup
  • Replacing atom with corresponding value from a list in Prolog
  • sum and conditionally count based on a second column
  • R do something after a warning (like tryCatch a warning, then edit an object)
  • Microsoft Access required for C# database access on end-user machines?
  • How to extract the legend labels from a ggplot2 object?
  • How can I maintain a color scheme across ggplots, while dropping unused levels in each plot?
  • Store files in Cloud using Java EE
  • ANOVA on multiple responses, by multiple groups NOT part of formula
  • Insert Pandas dataframe into Cassandra Table
  • RegistryKey.GetSubKeyNames returns names that are not in that subkey!
  • Get used tables from sql query [duplicate]
  • ggplot2: make the points on the line a darker color than the line color
  • How to access meteor package name inside package?
  • How do I remove all but some records based on a threshold?
  • as3-flash: any way to access all the instances placed in different frames from document class?
  • Using Sax parsing to edit and write XML in VB6
  • wxPython: displaying multiple widgets in same frame
  • Installed module is empty
  • Django: Count of Group Elements
  • R - Combining Columns to String Based on Logical Match
  • Sending data from AppleScript to FileMaker records
  • MySQL WHERE-condition in procedure ignored
  • Unanticipated behavior
  • VB.net deserialize, JSON Conversion from type 'Dictionary(Of String,Object)' to type '
  • using conditional logic : check if record exists; if it does, update it, if not, create it
  • KeystoneJS: Relationships in Admin UI not updating
  • trying to dynamically update Highchart column chart but series undefined
  • Why can't I rebase on to an ancestor of source changesets if on a different branch?
  • embed rChart in Markdown
  • Can't mass-assign protected attributes when import data from csv file
  • How to get NHibernate ISession to cache entity not retrieved by primary key
  • How can I use `wmic` in a Windows PE script?
  • Unable to use reactive element in my shiny app