51211

# remove factors with criteria

I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove rows with a factor that has a low total number of instances.

I've tried the following: create a table for the student name factor

```studenttable <- table(data\$Anon.Student.Id) ```

returns a table

```l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72 9890 7989 7665 7242 6928 6651 ```

then I can get a table that tells me if there are more than 1000 data points for a given factor level

```biginstances <- studenttable>1000 ```

then I tried making a subset of the data on this query

```bigdata <- subset(data, (biginstances[Anon.Student.Id])) ```

But I get weird subsets that still have the original number of factor levels as the full set. I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.

### Answer1:

There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)

```# Create some fake data dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15)) # tabulate the id variable tab <- table(dat\$id) # Get the names of the ids that we care about. # In this case the ids that occur >= 3 times idx <- names(tab)[tab >=3] # Only look at the data that we care about dat[dat\$id %in% idx,] ```

### Answer2:

@Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.

```biginstances <- studenttable>1000 ```

This will create a logical vector whose length is equal the number of unique student id's. `studenttable` contained a count for each unique value of `data\$Anon.Student.Id`. When you try to use that logical vector in `subset`:

```bigdata <- subset(data, (biginstances[Anon.Student.Id])) ```

it's length is almost surely much less than the number of rows in `data`. And since the subsetting criteria in `subset` is meant to identify rows of `data`, R's recycling rules take over and you get 'weird' looking subsets.

I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:

```> fac <- factor(rep(letters[1:3],each = 3)) > fac [1] a a a b b b c c c Levels: a b c > fac[-(1:3)] [1] b b b c c c Levels: a b c > droplevels(fac[-(1:3)]) [1] b b b c c c Levels: b c ```

So you'll want to use `droplevels` if you want to ensure that those levels are really 'gone'. Also, see `options(stringsAsFactors = FALSE)`.

### Answer3:

Another approach will involve a join between your dataset and the table of interest. I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)

```require(plyr) set.seed(123) Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE), var2 = 1:100) R> table(Data\$var1) A B C D E 19 20 21 22 18 ## rows with category less than 20 mytable <- count(Data, vars = "var1") ## mytable <- as.data.frame(table(Data\$var1)) R> str(mytable) 'data.frame': 5 obs. of 2 variables: \$ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 \$ freq: int 19 20 21 22 18 Data <- join(Data, mytable) ## Data <- merge(Data, mytable) R> str(Data) 'data.frame': 100 obs. of 3 variables: \$ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ... \$ var2: int 1 2 3 4 5 6 7 8 9 10 ... \$ freq: int 21 20 21 18 21 18 18 22 21 19 ... mysubset <- droplevels(subset(Data, freq > 20)) R> table(mysubset\$var1) C D 21 22 ```

Hope this help..

### Answer4:

this is how I managed to do this. I sorted the table of factors and associated counts.

```studenttable <- sort(studenttable, decreasing=TRUE) ```

now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.

```sum(studenttable>1000) 230 sum(studenttable<1000) 344 344+230=574 ```

now we know the first 230 factor levels are the ones we care about. So, we can do

```idx <- names(studenttable[1:230]) bigdata <- data[data\$Anon.Student.Id %in% idx,] ```

we can verify it worked by doing

```bigstudenttable <- table(bigdata\$Anon.Student.Id) ```

to get a print out and see all the factor levels with less than 1000 instances are now 0.