51211

remove factors with criteria

I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove rows with a factor that has a low total number of instances.

I've tried the following: create a table for the student name factor

studenttable <- table(data$Anon.Student.Id)

returns a table

l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72 9890 7989 7665 7242 6928 6651

then I can get a table that tells me if there are more than 1000 data points for a given factor level

biginstances <- studenttable>1000

then I tried making a subset of the data on this query

bigdata <- subset(data, (biginstances[Anon.Student.Id]))

But I get weird subsets that still have the original number of factor levels as the full set. I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.

Answer1:

There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)

# Create some fake data dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15)) # tabulate the id variable tab <- table(dat$id) # Get the names of the ids that we care about. # In this case the ids that occur >= 3 times idx <- names(tab)[tab >=3] # Only look at the data that we care about dat[dat$id %in% idx,]

Answer2:

@Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.

biginstances <- studenttable>1000

This will create a logical vector whose length is equal the number of unique student id's. studenttable contained a count for each unique value of data$Anon.Student.Id. When you try to use that logical vector in subset:

bigdata <- subset(data, (biginstances[Anon.Student.Id]))

it's length is almost surely much less than the number of rows in data. And since the subsetting criteria in subset is meant to identify rows of data, R's recycling rules take over and you get 'weird' looking subsets.

I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:

> fac <- factor(rep(letters[1:3],each = 3)) > fac [1] a a a b b b c c c Levels: a b c > fac[-(1:3)] [1] b b b c c c Levels: a b c > droplevels(fac[-(1:3)]) [1] b b b c c c Levels: b c

So you'll want to use droplevels if you want to ensure that those levels are really 'gone'. Also, see options(stringsAsFactors = FALSE).

Answer3:

Another approach will involve a join between your dataset and the table of interest. I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)

require(plyr) set.seed(123) Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE), var2 = 1:100) R> table(Data$var1) A B C D E 19 20 21 22 18 ## rows with category less than 20 mytable <- count(Data, vars = "var1") ## mytable <- as.data.frame(table(Data$var1)) R> str(mytable) 'data.frame': 5 obs. of 2 variables: $ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 $ freq: int 19 20 21 22 18 Data <- join(Data, mytable) ## Data <- merge(Data, mytable) R> str(Data) 'data.frame': 100 obs. of 3 variables: $ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ... $ var2: int 1 2 3 4 5 6 7 8 9 10 ... $ freq: int 21 20 21 18 21 18 18 22 21 19 ... mysubset <- droplevels(subset(Data, freq > 20)) R> table(mysubset$var1) C D 21 22

Hope this help..

Answer4:

this is how I managed to do this. I sorted the table of factors and associated counts.

studenttable <- sort(studenttable, decreasing=TRUE)

now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.

sum(studenttable>1000) 230 sum(studenttable<1000) 344 344+230=574

now we know the first 230 factor levels are the ones we care about. So, we can do

idx <- names(studenttable[1:230]) bigdata <- data[data$Anon.Student.Id %in% idx,]

we can verify it worked by doing

bigstudenttable <- table(bigdata$Anon.Student.Id)

to get a print out and see all the factor levels with less than 1000 instances are now 0.

Recommend

  • Import ASCII file into R
  • How to create a matrix of different format from a data frame in R?
  • Using ddply to apply a function to a group of rows
  • Formatting data for mlogit
  • How can I disable automatic filtering in selectize.js? Built-in / plugin / modilfy source?
  • Change zIndex in HighChart
  • MySQL: Difference between `… ADD INDEX(a); … ADD INDEX(b);` and `… ADD INDEX(a,b);`?
  • Is there a way to call library thread-local init/cleanup on thread creation/destruction?
  • AWS-SES: Handling Bounces for Invalid ISPs
  • Accessing Rows In A LINQ Result Without A Foreach Loop?
  • xtable - background colour of added rows
  • C++ Single function pointer for all template instances
  • Can I have a variable number of URI parameters or key-value pairs in Laravel 4?
  • as3-flash: any way to access all the instances placed in different frames from document class?
  • RxJava debounce by arbitrary value
  • wxPython: displaying multiple widgets in same frame
  • Tamper-proof configuration files in .NET?
  • Installed module is empty
  • onBackPressed() not being executed
  • Test if a set exists before trying to drop it
  • Django: Count of Group Elements
  • R - Combining Columns to String Based on Logical Match
  • Sony Xperia Z Tablet not found by adb
  • How to recover from a Spring Social ExpiredAuthorizationException
  • Javascript convert timezone issue
  • Javascript + PHP Encryption with pidCrypt
  • ActionScript 2 vs ActionScript 3 performance
  • Adding custom controls to a full screen movie
  • How can I estimate amount of memory left with calling System.gc()?
  • WOWZA + RTMP + HTML5 Playback?
  • VB.net deserialize, JSON Conversion from type 'Dictionary(Of String,Object)' to type '
  • Comma separated Values
  • Error creating VM instance in Google Compute Engine
  • Why can't I rebase on to an ancestor of source changesets if on a different branch?
  • Hits per day in Google Big Query
  • Why joiner is not used after Sequence generator or Update statergy
  • how does django model after text[] in postgresql [duplicate]
  • embed rChart in Markdown
  • How does Linux kernel interrupt the application?
  • Unable to use reactive element in my shiny app