在R中追加和合并两个不等长的数据集

时间:2014-12-11 19:32:24

标签: r merge

我正在尝试从另一个长度不同的数据集向我的数据集添加两个变量。我有一个coralreef调查数据集,我错过了每个站点和调查区域每次潜水的开始和结束时间。

此外,我还有一个表格,其中包含每个站点和区域每次潜水的开始和结束时间:

此表重复wpt(站点),因为每个站点测量2个区域,这意味着在此表中每行应该是唯一的。在我自己的数据集中,我有更多的wpt重复,因为我在同一个站点和区域中有几个观察。我需要匹配mergingdata的唯一行,将它合并到我的fishdata,返回合并数据的开始和结束时间。所以我希望通过" wpt"进行匹配和合并。并通过" zone"

这就是我的尝试:

merge<- merge(fishdata, mergingdata, by="wpt", all=TRUE, sort=FALSE)

但这只是按区域合并,我的输出得到一个名为zone.y的额外列 - 有没有一种方法可以通过2个变量的唯一组合进行合并? &#34; WPT&#34;和&#34;区&#34;?

谢谢!

1 个答案:

答案 0 :(得分:2)

合并help(merge)的文档说:

  

默认情况下,数据框合并在列上,并带有名称   两者都有,但可以给出单独的列规格   by.x和by.y。

由于在两个data.frames中都有两个id列,因此merge函数将使用这些公共列组合数据。因此,在代码中省略id参数应该有效。

merge<- merge(fishdata, mergingdata, all=TRUE, sort=FALSE)

但是,您也可以使用byby.xby.y参数指定标识符列,如下所示:

merge<- merge(fishdata, mergingdata, by=c("wpt","zone"), all=TRUE, sort=FALSE)

修改

查看您的帖子修改,我发现您的数据具有以下结构:

fishdata <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "23.11.2014", class = "factor"), 
    entry = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "shore", class = "factor"), 
    wpt = c(2L, 2L, 2L, 2L, 2L, 2L), zone = structure(c(1L, 1L, 
    1L, 1L, 1L, 1L), .Label = "DO", class = "factor"), transect = c(1L, 
    1L, 1L, 1L, 1L, 1L), gps = c(NA, NA, NA, NA, NA, NA), surveyor = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = "ev", class = "factor"), depth_code = c(NA, 
    NA, NA, NA, NA, NA), phase = structure(c(2L, 2L, 1L, 1L, 
    1L, 1L), .Label = c("S_PRIN", "S_STOP"), class = "factor"), 
    species = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("IP", 
    "TP"), class = "factor"), family = c(NA, NA, NA, NA, NA, 
    NA)), .Names = c("date", "entry", "wpt", "zone", "transect", 
"gps", "surveyor", "depth_code", "phase", "species", "family"
), class = "data.frame", row.names = c(NA, -6L))

mergingdata <- structure(list(start.time = c(10.34, 10.57, 10, 10.24, 9.15, 
9.39), end.time = c(10.5, 11.1, 10.2, 10.4, 9.3, 9.5), wpt = c(2L, 
2L, 3L, 3L, 4L, 4L), zone = structure(c(1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("DO", "LT"), class = "factor")), .Names = c("start.time", 
"end.time", "wpt", "zone"), class = "data.frame", row.names = c(NA, 
-6L))

假设数据集结构正确......

> fishdata
        date entry wpt zone transect gps surveyor depth_code  phase species family
1 23.11.2014 shore   2   DO        1  NA       ev         NA S_STOP      TP     NA
2 23.11.2014 shore   2   DO        1  NA       ev         NA S_STOP      IP     NA
3 23.11.2014 shore   2   DO        1  NA       ev         NA S_PRIN      TP     NA
4 23.11.2014 shore   2   DO        1  NA       ev         NA S_PRIN      TP     NA
5 23.11.2014 shore   2   DO        1  NA       ev         NA S_PRIN      IP     NA
6 23.11.2014 shore   2   DO        1  NA       ev         NA S_PRIN      IP     NA
> mergingdata
  start.time end.time wpt zone
1      10.34     10.5   2   DO
2      10.57     11.1   2   LT
3      10.00     10.2   3   DO
4      10.24     10.4   3   LT
5       9.15      9.3   4   DO
6       9.39      9.5   4   LT

我按照以下方式进行合并:

> merge(x = fishdata, y = mergingdata, all.x = TRUE)
  wpt zone       date entry transect gps surveyor depth_code  phase species family start.time end.time
1   2   DO 23.11.2014 shore        1  NA       ev         NA S_STOP      TP     NA      10.34     10.5
2   2   DO 23.11.2014 shore        1  NA       ev         NA S_STOP      IP     NA      10.34     10.5
3   2   DO 23.11.2014 shore        1  NA       ev         NA S_PRIN      TP     NA      10.34     10.5
4   2   DO 23.11.2014 shore        1  NA       ev         NA S_PRIN      TP     NA      10.34     10.5
5   2   DO 23.11.2014 shore        1  NA       ev         NA S_PRIN      IP     NA      10.34     10.5
6   2   DO 23.11.2014 shore        1  NA       ev         NA S_PRIN      IP     NA      10.34     10.5

请注意,我使用x.all=TRUE,因为我们想要的是让x对象中的所有行fishdata与y对象的额外列合并(mergingdata) 。所有这一切,通过使用两个对象的公共列作为索引。