R语言中aggregate函数
前言
这个函数的功能比较强大,它首先将数据进行分组(按行),然后对每一组数据进行函数统计,最后把结果组合成一个比较nice的表格返回。根据数据对象不同它有三种用法,分别应用于数据框(data.frame)、公式(formula)和时间序列(ts):
aggregate(x, by, FUN, ..., simplify = TRUE) aggregate(formula, data, FUN, ..., subset, na.action = na.omit) aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)
语法
aggregate(x, ...) ## S3 method for class ‘default‘: aggregate((x, ...)) ## S3 method for class ‘data.frame‘: aggregate((x, by, FUN, ..., simplify = TRUE)) ## S3 method for class ‘formula‘: aggregate((formula, data, FUN, ..., subset, na.action = na.omit)) ## S3 method for class ‘ts‘: aggregate((x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)) ###细节查看 ?aggregate
Example1
我们通过 mtcars 数据集的操作对这个函数进行简单了解。mtcars 是不同类型汽车道路测试的数据框类型数据:
> str(mtcars) ‘data.frame‘: 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
先用attach函数把mtcars的列变量名称加入到变量搜索范围内,然后使用aggregate函数按cyl(汽缸数)进行分类计算平均值:
> attach(mtcars) > aggregate(mtcars, by=list(cyl), FUN=mean) Group.1 mpg cyl disp hp drat wt qsec vs am gear carb 1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455 2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571 3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
by参数也可以包含多个类型的因子,得到的就是每个不同因子组合的统计结果:
> aggregate(mtcars, by=list(cyl, gear), FUN=mean) Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000
公式(formula)是一种特殊的R数据对象,在aggregate函数中使用公式参数可以对数据框的部分指标进行统计:
> aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean) cyl gear mpg hp 3 21.500 97.0000 3 19.750 107.5000 3 15.050 194.1667 4 26.925 76.0000 4 19.750 116.5000 5 28.200 102.0000 5 19.700 175.0000 5 15.400 299.5000
上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子组合对 cbind(mpg,hp) 数据进行操作。aggregate在时间序列数据上的应用请参考R的函数说明文档。
Example2
## Compute the averages for the variables in ‘state.x77‘, grouped ## according to the region (Northeast, South, North Central, West) that ## each state belongs to. aggregate(state.x77, list(Region = state.region), mean) ## Compute the averages according to region and the occurrence of more ## than 130 days of frost. aggregate(state.x77, list(Region = state.region, Cold = state.x77[,"Frost"] > 130), mean) ## (Note that no state in ‘South‘ is THAT cold.) ## example with character variables and NAs testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9), v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) ) by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12) by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA) aggregate(x = testDF, by = list(by1, by2), FUN = "mean") # and if you want to treat NAs as a group fby1 <- factor(by1, exclude = "") fby2 <- factor(by2, exclude = "") aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean") ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many: aggregate(weight ~ feed, data = chickwts, mean) aggregate(breaks ~ wool + tension, data = warpbreaks, mean) aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean) aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum) ## Dot notation: aggregate(. ~ Species, data = iris, mean) aggregate(len ~ ., data = ToothGrowth, mean) ## Often followed by xtabs(): ag <- aggregate(len ~ ., data = ToothGrowth, mean) xtabs(len ~ ., data = ag) ## Compute the average annual approval ratings for American presidents. aggregate(presidents, nfrequency = 1, FUN = mean) ## Give the summer less weight. aggregate(presidents, nfrequency = 1, FUN = weighted.mean, w = c(1, 1, 0.5, 1))
Example3
------------------------------------------------------ #load data data <- ChickWeight head(data) weight Time Chick Diet 1 42 0 1 1 2 51 2 1 1 3 59 4 1 1 4 64 6 1 1 5 76 8 1 1 6 93 10 1 1 #dimension of the data dim(data) [1] 578 4 #how many chickens unique(data$Chick) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48 #how many diets unique(data$Diet) [1] 1 2 3 4 Levels: 1 2 3 4 #how many time points unique(data$Time) [1] 0 2 4 6 8 10 12 14 16 18 20 21 library(ggplot2) ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) + geom_line() + geom_point() ------------------------------------------------------ ## S3 method for class ‘data.frame‘ ## aggregate(x, by, FUN, ..., simplify = TRUE) #find the mean weight depending on diet aggregate(data$weight, list(diet = data$Diet), mean) diet x 1 1 102.6455 2 2 122.6167 3 3 142.9500 4 4 135.2627 #aggregate on time aggregate(data$weight, list(time=data$Time), mean) time x 1 0 41.06000 2 2 49.22000 3 4 59.95918 4 6 74.30612 5 8 91.24490 6 10 107.83673 7 12 129.24490 8 14 143.81250 9 16 168.08511 10 18 190.19149 11 20 209.71739 12 21 218.68889 #use a different function aggregate(data$weight, list(time=data$Time), sd) time x 1 0 1.132272 2 2 3.688316 3 4 4.495179 4 6 9.012038 5 8 16.239780 6 10 23.987277 7 12 34.119600 8 14 38.300412 9 16 46.904079 10 18 57.394757 11 20 66.511708 12 21 71.510273 #we could also aggregate on time and diet head(aggregate(data$weight, list(time = data$Time, diet = data$Diet), mean ) ) time diet x 1 0 1 41.40000 2 2 1 47.25000 3 4 1 56.47368 4 6 1 66.78947 5 8 1 79.68421 6 10 1 93.05263 tail(aggregate(data$weight, list(time = data$Time, diet = data$Diet), mean ) ) time diet x 43 12 4 151.4000 44 14 4 161.8000 45 16 4 182.0000 46 18 4 202.9000 47 20 4 233.8889 48 21 4 238.5556 #to see the weights over time across different diets ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) + facet_wrap(~Diet) + guides(col=guide_legend(ncol=3)) ------------------------------------------------------
Example4
The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.
# Get a count of number of subjects in each category (sex*condition) cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length) cdata #> sex condition subject #> 1 F aspirin 5 #> 2 M aspirin 9 #> 3 F placebo 12 #> 4 M placebo 4 # Rename "subject" column to "N" names(cdata)[names(cdata)=="subject"] <- "N" cdata #> sex condition N #> 1 F aspirin 5 #> 2 M aspirin 9 #> 3 F placebo 12 #> 4 M placebo 4 # Sort by sex first cdata <- cdata[order(cdata$sex),] cdata #> sex condition N #> 1 F aspirin 5 #> 3 F placebo 12 #> 2 M aspirin 9 #> 4 M placebo 4 # We also keep the __before__ and __after__ columns: # Get the average effect size by sex and condition cdata.means <- aggregate(data[c("before","after","change")], by = data[c("sex","condition")], FUN=mean) cdata.means #> sex condition before after change #> 1 F aspirin 11.06000 7.640000 -3.420000 #> 2 M aspirin 11.26667 5.855556 -5.411111 #> 3 F placebo 10.13333 8.075000 -2.058333 #> 4 M placebo 11.47500 10.500000 -0.975000 # Merge the data frames cdata <- merge(cdata, cdata.means) cdata #> sex condition N before after change #> 1 F aspirin 5 11.06000 7.640000 -3.420000 #> 2 F placebo 12 10.13333 8.075000 -2.058333 #> 3 M aspirin 9 11.26667 5.855556 -5.411111 #> 4 M placebo 4 11.47500 10.500000 -0.975000 # Get the sample (n-1) standard deviation for "change" cdata.sd <- aggregate(data["change"], by = data[c("sex","condition")], FUN=sd) # Rename the column to change.sd names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd" cdata.sd #> sex condition change.sd #> 1 F aspirin 0.8642916 #> 2 M aspirin 1.1307569 #> 3 F placebo 0.5247655 #> 4 M placebo 0.7804913 # Merge cdata <- merge(cdata, cdata.sd) cdata #> sex condition N before after change change.sd #> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 #> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 #> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 #> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 # Calculate standard error of the mean cdata$change.se <- cdata$change.sd / sqrt(cdata$N) cdata #> sex condition N before after change change.sd change.se #> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 0.3865230 #> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 0.1514867 #> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 0.3769190 #> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 0.3902456
If you have NA’s in your data and wish to skip them, use na.rm=TRUE:
cdata.means <- aggregate(data[c("before","after","change")], by = data[c("sex","condition")], FUN=mean, na.rm=TRUE) cdata.means #> sex condition before after change #> 1 F aspirin 11.06000 7.640000 -3.420000 #> 2 M aspirin 11.26667 5.855556 -5.411111 #> 3 F placebo 10.13333 8.075000 -2.058333 #> 4 M placebo 11.47500 10.500000 -0.975000