合并数据框中的相似行

时间:2015-11-09 01:43:41

标签: r

我有一个数据框:

       Title       Date    year lai biomass grain_wt wet_yield
1    HartogSowN 2014-07-31 2014 4.4      NA       NA        NA
2    HartogMild 2014-07-31 2014 3.7      NA       NA        NA
3  HartogSevere 2014-07-31 2014 2.3      NA       NA        NA
4    HartogSowN 2014-08-12 2014 6.1      NA       NA        NA
5    HartogMild 2014-08-12 2014 6.6      NA       NA        NA
6  HartogSevere 2014-08-12 2014 3.8      NA       NA        NA
7    HartogSowN 2014-11-10 2014  NA   16116       NA        NA
8    HartogMild 2014-11-10 2014  NA   18224       NA        NA
9  HartogSevere 2014-11-10 2014  NA   18184       NA        NA
10   HartogSowN 2014-11-10 2014  NA      NA    0.041        NA
11   HartogMild 2014-11-10 2014  NA      NA    0.040        NA
12 HartogSevere 2014-11-10 2014  NA      NA    0.038        NA
13   HartogSowN 2014-08-12 2014  NA    4511       NA        NA
14   HartogMild 2014-08-12 2014  NA    4525       NA        NA
15 HartogSevere 2014-08-12 2014  NA    3167       NA        NA
16   HartogSowN 2014-07-31 2014  NA    2837       NA        NA
17   HartogMild 2014-07-31 2014  NA    2444       NA        NA
18 HartogSevere 2014-07-31 2014  NA    1940       NA        NA
19   HartogSowN 2014-11-10 2014  NA      NA       NA    8457.4
20   HartogMild 2014-11-10 2014  NA      NA       NA    8662.4
21 HartogSevere 2014-11-10 2014  NA      NA       NA    8537.8
22   HartogSowN 2014-11-10 2014  NA      NA       NA        NA
23   HartogMild 2014-11-10 2014  NA      NA       NA        NA
24 HartogSevere 2014-11-10 2014  NA      NA       NA        NA

structure(list(Title = c("HartogSowN", "HartogMild", "HartogSevere", 
"HartogSowN", "HartogMild", "HartogSevere", "HartogSowN",
"HartogMild",  "HartogSevere", "HartogSowN", "HartogMild",
"HartogSevere", "HartogSowN",  "HartogMild", "HartogSevere",
"HartogSowN", "HartogMild", "HartogSevere",  "HartogSowN",
"HartogMild", "HartogSevere", "HartogSowN", "HartogMild", 
"HartogSevere"), Date = structure(c(16282, 16282, 16282, 16294, 
16294, 16294, 16384, 16384, 16384, 16384, 16384, 16384, 16294,  16294,
16294, 16282, 16282, 16282, 16384, 16384, 16384, 16384,  16384,
16384), class = "Date"), year = c(2014, 2014, 2014, 2014,  2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,  2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014), lai = c(4.4, 
3.7, 2.3, 6.1, 6.6, 3.8, NA, NA, NA, NA, NA, NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA, NA, NA), biomass = c(NA, NA, NA,  NA, NA, NA,
16116, 18224, 18184, NA, NA, NA, 4511, 4525, 3167,  2837, 2444, 1940,
NA, NA, NA, NA, NA, NA), grain_wt = c(NA, NA,  NA, NA, NA, NA, NA, NA,
NA, 0.041, 0.04, 0.038, NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA,
NA), wet_yield = c(NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 8457.4, 
8662.4, 8537.8, NA, NA, NA)), .Names = c("Title", "Date", "year",  "lai", "biomass", "grain_wt", "wet_yield"), row.names = c(NA,  24L),
class = "data.frame")

我想折叠行,以便给定Title和Date组合的所有数据都在一行上,并删除了额外的行。我找到了类似问题的答案,但它们都涉及修改原始数据。

期望的输出:

       Title       Date    year lai biomass grain_wt wet_yield
1    HartogSowN 2014-07-31 2014 4.4    2837       NA        NA
2    HartogMild 2014-07-31 2014 3.7    2444       NA        NA
3  HartogSevere 2014-07-31 2014 2.3    1940       NA        NA
4    HartogSowN 2014-08-12 2014 6.1    4511       NA        NA
5    HartogMild 2014-08-12 2014 6.6    4525       NA        NA
6  HartogSevere 2014-08-12 2014 3.8    3167       NA        NA
7    HartogSowN 2014-11-10 2014  NA   16116    0.041    8457.4
8    HartogMild 2014-11-10 2014  NA   18224    0.040    8662.4
9  HartogSevere 2014-11-10 2014  NA   18184    0.038    8537.8
22   HartogSowN 2014-11-10 2014  NA      NA       NA        NA
23   HartogMild 2014-11-10 2014  NA      NA       NA        NA
24 HartogSevere 2014-11-10 2014  NA      NA       NA        NA

用额外的行保持生物量,grain_wt和wet_yield被移除。

更新:谢谢Pascal,是的,天应该匹配,我的错误。我已经更新了所需的结果。

更新2:为了清晰起见,添加了完整的所需输出。

2 个答案:

答案 0 :(得分:3)

使用aggregate()考虑以下基本R解决方案。下面使用中位数作为函数,但任何聚合都应该起作用(平均值,最小值,最大值等),但是不同的处理方式。

# AGGREGATED DF 
collapsedf <- aggregate(list(lai=df$lai,
                             biomass=df$biomass, 
                             grain_wt=df$grain_wt, 
                             wet_yield=df$wet_yield), 
                        list(Title=df$Title, Date=df$Date, year=df$year), 
                        FUN=median, na.rm=TRUE)

或者@thelatemail简化:

collapsedf <- aggregate(df[c("lai","biomass","grain_wt","wet_yield")], 
                        df[c("Title","Date","year")], FUN=median, na.rm=TRUE)

<强>输出

    Title           Date        year    lai   biomass    grain_wt   wet_yield
1   HartogMild      11/10/2014  2014    NA    18224      0.040      8662.4
2   HartogSevere    11/10/2014  2014    NA    18184      0.038      8537.8
3   HartogSowN      11/10/2014  2014    NA    16116      0.041      8457.4
4   HartogMild       7/31/2014  2014    3.7   2444       NA         NA
5   HartogSevere     7/31/2014  2014    2.3   1940       NA         NA
6   HartogSowN       7/31/2014  2014    4.4   2837       NA         NA
7   HartogMild       8/12/2014  2014    6.6   4525       NA         NA
8   HartogSevere     8/12/2014  2014    3.8   3167       NA         NA
9   HartogSowN       8/12/2014  2014    6.1   4511       NA         NA

答案 1 :(得分:3)

假设每个Title/Date组合的每列只有一个有效数据,您可以使用aggregate获得所需的结果:

aggregate(. ~ Title + Date + year, data=df,
          FUN=function(x) x[!is.na(x)][1], na.action=na.pass)

#         Title       Date year lai biomass grain_wt wet_yield
#1   HartogMild 2014-07-31 2014 3.7    2444       NA        NA
#2 HartogSevere 2014-07-31 2014 2.3    1940       NA        NA
#3   HartogSowN 2014-07-31 2014 4.4    2837       NA        NA
#4   HartogMild 2014-08-12 2014 6.6    4525       NA        NA
#5 HartogSevere 2014-08-12 2014 3.8    3167       NA        NA
#6   HartogSowN 2014-08-12 2014 6.1    4511       NA        NA
#7   HartogMild 2014-11-10 2014  NA   18224    0.040    8662.4
#8 HartogSevere 2014-11-10 2014  NA   18184    0.038    8537.8
#9   HartogSowN 2014-11-10 2014  NA   16116    0.041    8457.4

这使用Title + Date + year作为分组变量,处理所有剩余数据列.

该函数只返回每个组中每个列的一个非缺失数据 - !is.na(x)

如果没有未丢失的数据,则需要[1]以确保返回NA。例如。 - numeric(0)[1]返回NA

na.action=na.pass是必需的,因为aggregatey ~ x公式一起使用时,默认情况下会丢弃NA个值的所有行 - na.action=na.omit是默认值