I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove rows with a factor that has a low total number of instances.

I've tried the following: create a table for the student name factor

```
studenttable <- table(data$Anon.Student.Id)
```

returns a table

```
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
```

then I can get a table that tells me if there are more than 1000 data points for a given factor level

```
biginstances <- studenttable>1000
```

then I tried making a subset of the data on this query

```
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
```

But I get weird subsets that still have the original number of factor levels as the full set. I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.

### Answer1:

There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)

```
# Create some fake data
dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15))
# tabulate the id variable
tab <- table(dat$id)
# Get the names of the ids that we care about.
# In this case the ids that occur >= 3 times
idx <- names(tab)[tab >=3]
# Only look at the data that we care about
dat[dat$id %in% idx,]
```

### Answer2:

@Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.

```
biginstances <- studenttable>1000
```

This will create a logical vector whose length is equal the number of unique student id's. `studenttable`

contained a count for each unique value of `data$Anon.Student.Id`

. When you try to use that logical vector in `subset`

:

```
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
```

it's length is almost surely much less than the number of rows in `data`

. And since the subsetting criteria in `subset`

is meant to identify rows of `data`

, R's recycling rules take over and you get 'weird' looking subsets.

I would also add that taking subsets to remove rare factor levels will *not change the levels attribute of the factor*. In other words, you'll get a factor back with no *instances* of that level, but all of the original factor levels will remain in the levels attribute. For example:

```
> fac <- factor(rep(letters[1:3],each = 3))
> fac
[1] a a a b b b c c c
Levels: a b c
> fac[-(1:3)]
[1] b b b c c c
Levels: a b c
> droplevels(fac[-(1:3)])
[1] b b b c c c
Levels: b c
```

So you'll want to use `droplevels`

if you want to ensure that those levels are really 'gone'. Also, see `options(stringsAsFactors = FALSE)`

.

### Answer3:

Another approach will involve a join between your dataset and the table of interest. I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)

```
require(plyr)
set.seed(123)
Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE),
var2 = 1:100)
R> table(Data$var1)
A B C D E
19 20 21 22 18
## rows with category less than 20
mytable <- count(Data, vars = "var1")
## mytable <- as.data.frame(table(Data$var1))
R> str(mytable)
'data.frame': 5 obs. of 2 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ freq: int 19 20 21 22 18
Data <- join(Data, mytable)
## Data <- merge(Data, mytable)
R> str(Data)
'data.frame': 100 obs. of 3 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ...
$ var2: int 1 2 3 4 5 6 7 8 9 10 ...
$ freq: int 21 20 21 18 21 18 18 22 21 19 ...
mysubset <- droplevels(subset(Data, freq > 20))
R> table(mysubset$var1)
C D
21 22
```

Hope this help..

### Answer4:

this is how I managed to do this. I sorted the table of factors and associated counts.

```
studenttable <- sort(studenttable, decreasing=TRUE)
```

now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.

```
sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574
```

now we know the first 230 factor levels are the ones we care about. So, we can do

```
idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]
```

we can verify it worked by doing

```
bigstudenttable <- table(bigdata$Anon.Student.Id)
```

to get a print out and see all the factor levels with less than 1000 instances are now 0.