我有一个数据框:
Title Date year lai biomass grain_wt wet_yield
1 HartogSowN 2014-07-31 2014 4.4 NA NA NA
2 HartogMild 2014-07-31 2014 3.7 NA NA NA
3 HartogSevere 2014-07-31 2014 2.3 NA NA NA
4 HartogSowN 2014-08-12 2014 6.1 NA NA NA
5 HartogMild 2014-08-12 2014 6.6 NA NA NA
6 HartogSevere 2014-08-12 2014 3.8 NA NA NA
7 HartogSowN 2014-11-10 2014 NA 16116 NA NA
8 HartogMild 2014-11-10 2014 NA 18224 NA NA
9 HartogSevere 2014-11-10 2014 NA 18184 NA NA
10 HartogSowN 2014-11-10 2014 NA NA 0.041 NA
11 HartogMild 2014-11-10 2014 NA NA 0.040 NA
12 HartogSevere 2014-11-10 2014 NA NA 0.038 NA
13 HartogSowN 2014-08-12 2014 NA 4511 NA NA
14 HartogMild 2014-08-12 2014 NA 4525 NA NA
15 HartogSevere 2014-08-12 2014 NA 3167 NA NA
16 HartogSowN 2014-07-31 2014 NA 2837 NA NA
17 HartogMild 2014-07-31 2014 NA 2444 NA NA
18 HartogSevere 2014-07-31 2014 NA 1940 NA NA
19 HartogSowN 2014-11-10 2014 NA NA NA 8457.4
20 HartogMild 2014-11-10 2014 NA NA NA 8662.4
21 HartogSevere 2014-11-10 2014 NA NA NA 8537.8
22 HartogSowN 2014-11-10 2014 NA NA NA NA
23 HartogMild 2014-11-10 2014 NA NA NA NA
24 HartogSevere 2014-11-10 2014 NA NA NA NA
structure(list(Title = c("HartogSowN", "HartogMild", "HartogSevere",
"HartogSowN", "HartogMild", "HartogSevere", "HartogSowN",
"HartogMild", "HartogSevere", "HartogSowN", "HartogMild",
"HartogSevere", "HartogSowN", "HartogMild", "HartogSevere",
"HartogSowN", "HartogMild", "HartogSevere", "HartogSowN",
"HartogMild", "HartogSevere", "HartogSowN", "HartogMild",
"HartogSevere"), Date = structure(c(16282, 16282, 16282, 16294,
16294, 16294, 16384, 16384, 16384, 16384, 16384, 16384, 16294, 16294,
16294, 16282, 16282, 16282, 16384, 16384, 16384, 16384, 16384,
16384), class = "Date"), year = c(2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014), lai = c(4.4,
3.7, 2.3, 6.1, 6.6, 3.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), biomass = c(NA, NA, NA, NA, NA, NA,
16116, 18224, 18184, NA, NA, NA, 4511, 4525, 3167, 2837, 2444, 1940,
NA, NA, NA, NA, NA, NA), grain_wt = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0.041, 0.04, 0.038, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), wet_yield = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 8457.4,
8662.4, 8537.8, NA, NA, NA)), .Names = c("Title", "Date", "year", "lai", "biomass", "grain_wt", "wet_yield"), row.names = c(NA, 24L),
class = "data.frame")
我想折叠行,以便给定Title和Date组合的所有数据都在一行上,并删除了额外的行。我找到了类似问题的答案,但它们都涉及修改原始数据。
期望的输出:
Title Date year lai biomass grain_wt wet_yield
1 HartogSowN 2014-07-31 2014 4.4 2837 NA NA
2 HartogMild 2014-07-31 2014 3.7 2444 NA NA
3 HartogSevere 2014-07-31 2014 2.3 1940 NA NA
4 HartogSowN 2014-08-12 2014 6.1 4511 NA NA
5 HartogMild 2014-08-12 2014 6.6 4525 NA NA
6 HartogSevere 2014-08-12 2014 3.8 3167 NA NA
7 HartogSowN 2014-11-10 2014 NA 16116 0.041 8457.4
8 HartogMild 2014-11-10 2014 NA 18224 0.040 8662.4
9 HartogSevere 2014-11-10 2014 NA 18184 0.038 8537.8
22 HartogSowN 2014-11-10 2014 NA NA NA NA
23 HartogMild 2014-11-10 2014 NA NA NA NA
24 HartogSevere 2014-11-10 2014 NA NA NA NA
用额外的行保持生物量,grain_wt和wet_yield被移除。
更新:谢谢Pascal,是的,天应该匹配,我的错误。我已经更新了所需的结果。
更新2:为了清晰起见,添加了完整的所需输出。
答案 0 :(得分:3)
使用aggregate()考虑以下基本R解决方案。下面使用中位数作为函数,但任何聚合都应该起作用(平均值,最小值,最大值等),但是不同的处理方式。
# AGGREGATED DF
collapsedf <- aggregate(list(lai=df$lai,
biomass=df$biomass,
grain_wt=df$grain_wt,
wet_yield=df$wet_yield),
list(Title=df$Title, Date=df$Date, year=df$year),
FUN=median, na.rm=TRUE)
或者@thelatemail简化:
collapsedf <- aggregate(df[c("lai","biomass","grain_wt","wet_yield")],
df[c("Title","Date","year")], FUN=median, na.rm=TRUE)
<强>输出强>
Title Date year lai biomass grain_wt wet_yield
1 HartogMild 11/10/2014 2014 NA 18224 0.040 8662.4
2 HartogSevere 11/10/2014 2014 NA 18184 0.038 8537.8
3 HartogSowN 11/10/2014 2014 NA 16116 0.041 8457.4
4 HartogMild 7/31/2014 2014 3.7 2444 NA NA
5 HartogSevere 7/31/2014 2014 2.3 1940 NA NA
6 HartogSowN 7/31/2014 2014 4.4 2837 NA NA
7 HartogMild 8/12/2014 2014 6.6 4525 NA NA
8 HartogSevere 8/12/2014 2014 3.8 3167 NA NA
9 HartogSowN 8/12/2014 2014 6.1 4511 NA NA
答案 1 :(得分:3)
假设每个Title/Date
组合的每列只有一个有效数据,您可以使用aggregate
获得所需的结果:
aggregate(. ~ Title + Date + year, data=df,
FUN=function(x) x[!is.na(x)][1], na.action=na.pass)
# Title Date year lai biomass grain_wt wet_yield
#1 HartogMild 2014-07-31 2014 3.7 2444 NA NA
#2 HartogSevere 2014-07-31 2014 2.3 1940 NA NA
#3 HartogSowN 2014-07-31 2014 4.4 2837 NA NA
#4 HartogMild 2014-08-12 2014 6.6 4525 NA NA
#5 HartogSevere 2014-08-12 2014 3.8 3167 NA NA
#6 HartogSowN 2014-08-12 2014 6.1 4511 NA NA
#7 HartogMild 2014-11-10 2014 NA 18224 0.040 8662.4
#8 HartogSevere 2014-11-10 2014 NA 18184 0.038 8537.8
#9 HartogSowN 2014-11-10 2014 NA 16116 0.041 8457.4
这使用Title + Date + year
作为分组变量,处理所有剩余数据列.
该函数只返回每个组中每个列的一个非缺失数据 - !is.na(x)
。
如果没有未丢失的数据,则需要[1]
以确保返回NA
。例如。 - numeric(0)[1]
返回NA
。
na.action=na.pass
是必需的,因为aggregate
与y ~ x
公式一起使用时,默认情况下会丢弃NA
个值的所有行 - na.action=na.omit
是默认值