77947

How to change few column names in a data table

Question:

I have a data table with 10 columns.

town tc one two three four five six seven total

Need to generate mean for columns "one" to "total" for which I am using,

DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10]

This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean"

for (i in 3:10) { setnames(DTmean,i,paste0(names(i),"_mean")) }

Answer1:

If you want to do it the data.table way, you should use setnames as follows:

setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))

or:

cols <- names(DT)[3:10] setnames(DTmean, cols, paste0(cols, '_mean'))

Furthermore, you don't need the .SDcols statement as you are aggregating all the other columns. Using DT[, lapply(.SD,mean), by = .(town,tc)] should thus give you the same result as using DT[, (lapply(.SD,mean)), by = .(town,tc), .SDcols=3:10].

<hr />

On the following example dataset:

set.seed(71) DT <- data.table(town = rep(c('A','B'), each=10), tc = rep(c('C','D'), 10), one = rnorm(20,1,1), two = rnorm(20,2,1), three = rnorm(20,3,1), four = rnorm(20,4,1), five = rnorm(20,5,2), six = rnorm(20,6,2), seven = rnorm(20,7,2), total = rnorm(20,28,3))

using:

DTmean <- DT[, lapply(.SD,mean), by = .(town,tc)] setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))

gives:

> DTmean town tc one_mean two_mean three_mean four_mean five_mean six_mean seven_mean total_mean 1: A C 1.7368898 1.883586 3.358440 4.849896 4.742609 5.089877 6.792513 29.20286 2: A D 0.8906842 1.826135 3.267684 3.760931 6.210145 7.320693 5.571687 26.56142 3: B C 1.4037955 2.474836 2.587920 3.719658 3.446612 6.510183 8.309784 27.80012 4: B D 0.8103511 1.153000 3.360940 3.945082 5.555999 6.198380 8.652779 28.95180 <hr />

In reply to your comment: If you want to calculate both the mean and the sd simultanuously, you could do (adapted from my answer <a href="https://stackoverflow.com/a/34724430/2204410" rel="nofollow">here</a>):

DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))))), by = .(town,tc)]

which gives:

town tc one.mean one.sd two.mean two.sd three.mean three.sd four.mean four.sd five.mean five.sd six.mean six.sd seven.mean seven.sd total.mean total.sd 1: A C 0.2981842 0.3556520 1.578174 0.7788545 2.232366 0.9047046 4.896201 1.238877 4.625866 0.7436584 7.607439 1.7262628 7.949366 1.772771 28.94287 3.902602 2: A D 1.2099018 1.0205252 1.686068 1.5497989 2.671027 0.8323733 4.811279 1.404794 7.235969 0.7883873 6.765797 2.7719942 6.657298 1.107843 27.42563 3.380785 3: B C 0.9238309 0.6679821 2.525485 0.8054734 3.138298 1.0111270 3.876207 0.573342 3.843140 2.1991052 4.942155 0.7784024 6.783383 2.595116 28.95243 1.078307 4: B D 0.8843948 0.9384975 1.988908 1.0543981 3.673393 1.3505701 3.957534 1.097837 2.788119 1.9089660 6.463784 0.7642144 6.416487 2.041441 27.88205 3.807119

However, it is highly probable better to store this in long format. To get this you could use data.table's melt function as follows:

cols <- names(DT)[3:10] DT2 <- melt(DT[, as.list(unlist(lapply(.SD, function(x) list(mn = mean(x), sdev = sd(x))))), by = .(town,tc)], id.vars = c('town','tc'), measure.vars = patterns('.mn','.sdev'), value.name = c('mn','sdev'))[, variable := cols[variable]]

or in a much simpler operation:

DT2 <- melt(DT, id.vars = c('town','tc'))[, .(mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)]

which results in:

> DT2 town tc variable mn sdev 1: A C one 0.2981842 0.3556520 2: A D one 1.2099018 1.0205252 3: B C one 0.9238309 0.6679821 4: B D one 0.8843948 0.9384975 5: A C two 1.5781743 0.7788545 6: A D two 1.6860675 1.5497989 7: B C two 2.5254855 0.8054734 8: B D two 1.9889082 1.0543981 9: A C three 2.2323655 0.9047046 10: A D three 2.6710267 0.8323733 11: B C three 3.1382982 1.0111270 12: B D three 3.6733929 1.3505701 ..... <hr />

In response to your last comments, you can detect outliers as follows:

DT3 <- melt(DT, id.vars = c('town','tc')) DT3[, `:=` (mn = mean(value), sdev = sd(value)), by = .(town,tc,variable) ][, outlier := +(value < mn - sdev | value > mn + sdev)]

which gives:

town tc variable value mn sdev outlier 1: A C one 0.5681578 0.2981842 0.355652 0 2: A D one 0.5528128 1.2099018 1.020525 0 3: A C one 0.5214274 0.2981842 0.355652 0 4: A D one 1.4171454 1.2099018 1.020525 0 5: A C one 0.5820994 0.2981842 0.355652 0 --- 156: B D total 23.4462542 27.8820524 3.807119 1 157: B C total 30.5934956 28.9524305 1.078307 1 158: B D total 30.5618759 27.8820524 3.807119 0 159: B C total 27.5940307 28.9524305 1.078307 1 160: B D total 24.8378437 27.8820524 3.807119 0

Recommend