我正在尝试从另一个长度不同的数据集向我的数据集添加两个变量。我有一个coralreef调查数据集,我错过了每个站点和调查区域每次潜水的开始和结束时间。
此外,我还有一个表格,其中包含每个站点和区域每次潜水的开始和结束时间:
此表重复wpt(站点),因为每个站点测量2个区域,这意味着在此表中每行应该是唯一的。在我自己的数据集中,我有更多的wpt重复,因为我在同一个站点和区域中有几个观察。我需要匹配mergingdata的唯一行,将它合并到我的fishdata,返回合并数据的开始和结束时间。所以我希望通过" wpt"进行匹配和合并。并通过" zone"
这就是我的尝试:
merge<- merge(fishdata, mergingdata, by="wpt", all=TRUE, sort=FALSE)
但这只是按区域合并,我的输出得到一个名为zone.y的额外列 - 有没有一种方法可以通过2个变量的唯一组合进行合并? &#34; WPT&#34;和&#34;区&#34;?
谢谢!
答案 0 :(得分:2)
合并help(merge)
的文档说:
默认情况下,数据框合并在列上,并带有名称 两者都有,但可以给出单独的列规格 by.x和by.y。
由于在两个data.frames中都有两个id列,因此merge函数将使用这些公共列组合数据。因此,在代码中省略id参数应该有效。
merge<- merge(fishdata, mergingdata, all=TRUE, sort=FALSE)
但是,您也可以使用by
,by.x
和by.y
参数指定标识符列,如下所示:
merge<- merge(fishdata, mergingdata, by=c("wpt","zone"), all=TRUE, sort=FALSE)
修改强>
查看您的帖子修改,我发现您的数据具有以下结构:
fishdata <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "23.11.2014", class = "factor"),
entry = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "shore", class = "factor"),
wpt = c(2L, 2L, 2L, 2L, 2L, 2L), zone = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = "DO", class = "factor"), transect = c(1L,
1L, 1L, 1L, 1L, 1L), gps = c(NA, NA, NA, NA, NA, NA), surveyor = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ev", class = "factor"), depth_code = c(NA,
NA, NA, NA, NA, NA), phase = structure(c(2L, 2L, 1L, 1L,
1L, 1L), .Label = c("S_PRIN", "S_STOP"), class = "factor"),
species = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("IP",
"TP"), class = "factor"), family = c(NA, NA, NA, NA, NA,
NA)), .Names = c("date", "entry", "wpt", "zone", "transect",
"gps", "surveyor", "depth_code", "phase", "species", "family"
), class = "data.frame", row.names = c(NA, -6L))
mergingdata <- structure(list(start.time = c(10.34, 10.57, 10, 10.24, 9.15,
9.39), end.time = c(10.5, 11.1, 10.2, 10.4, 9.3, 9.5), wpt = c(2L,
2L, 3L, 3L, 4L, 4L), zone = structure(c(1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("DO", "LT"), class = "factor")), .Names = c("start.time",
"end.time", "wpt", "zone"), class = "data.frame", row.names = c(NA,
-6L))
假设数据集结构正确......
> fishdata
date entry wpt zone transect gps surveyor depth_code phase species family
1 23.11.2014 shore 2 DO 1 NA ev NA S_STOP TP NA
2 23.11.2014 shore 2 DO 1 NA ev NA S_STOP IP NA
3 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
4 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
5 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
6 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
> mergingdata
start.time end.time wpt zone
1 10.34 10.5 2 DO
2 10.57 11.1 2 LT
3 10.00 10.2 3 DO
4 10.24 10.4 3 LT
5 9.15 9.3 4 DO
6 9.39 9.5 4 LT
我按照以下方式进行合并:
> merge(x = fishdata, y = mergingdata, all.x = TRUE)
wpt zone date entry transect gps surveyor depth_code phase species family start.time end.time
1 2 DO 23.11.2014 shore 1 NA ev NA S_STOP TP NA 10.34 10.5
2 2 DO 23.11.2014 shore 1 NA ev NA S_STOP IP NA 10.34 10.5
3 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
4 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
5 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
6 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
请注意,我使用x.all=TRUE
,因为我们想要的是让x对象中的所有行fishdata
与y对象的额外列合并(mergingdata
) 。所有这一切,通过使用两个对象的公共列作为索引。