Question

我有两个每日时间序列，其中20个电台列为id，我试图在R中合并这两个数据帧。数据帧列标题如下。

names(data1)
"id"     "year"   "yday"   "date"   "t_1"  "t_2"  "t_3" "r1"   "s1"   "p1" "s2"
names(data2)
"id"     "year"   "yday"   "date"   "t_1"  "t_2"  "t_3" "r1"   "s1"   "p1"

我尝试使用以下代码合并它

newdata<- merge(data1,data2, all=TRUE)

这只是部分解决方案。但是，对于某些id，两个数据帧之间的日期重叠，同一日期在data1中有NA，而data2没有缺失值。我试图合并它，以便我可以摆脱每个id的重复日期，同时保留可用于复制的数据并保留data1的列标题。例如，它是如何合并的

id     year     yday    date     t_1    t_2  t_3     r1 s1  p1  s2

AA1111  2007    3   03/01/2007  -5.3    -11.6   -8.5    0   0   0   NA          
AA1111  2007    3   03/01/2007  -5.3    -11.6   NA  NA  NA  0   32

我想

id     year     yday    date     t_1    t_2     t_3   r1    s1  p1  s2

AA1111  2007    3   03/01/2007  -5.3    -11.6   -8.5    0   0   0   32

上面的代码并没有给我很满意的结果。您对如何获得我想要的结果的指导将非常感激（因为我仍在尝试使用R）

Answer 1

如果没有代表性数据，这里的方法基于我创建的样本数据。

# sample merged data (hope it represents your data fully)
DF = structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("AA1111", 
"AA1112"), class = "factor"), year = c(2007L, 2007L, 2008L, 2008L, 
2008L), yday = c(3L, 3L, 3L, 3L, 3L), date = structure(c(1L, 
1L, 1L, 1L, 1L), .Label = "03/01/2007", class = "factor"), t_1 = c(-5.3, 
-5.3, -10.3, -10.3, NA), t_2 = c(-11.6, -11.6, -11.6, NA, -11.6
), t_3 = c(-8.5, NA, -8.5, -8.5, -8.5), r1 = c(0L, NA, 0L, 0L, 
0L), s1 = c(0L, NA, 0L, 0L, 0L), p1 = c(0L, 0L, 0L, NA, 0L), 
    s2 = c(NA, 32L, NA, 42L, NA)), .Names = c("id", "year", "yday", 
"date", "t_1", "t_2", "t_3", "r1", "s1", "p1", "s2"), class = "data.frame", row.names = c(NA, 
-5L))

#       id year yday       date   t_1   t_2  t_3 r1 s1 p1 s2
# 1 AA1111 2007    3 03/01/2007  -5.3 -11.6 -8.5  0  0  0 NA
# 2 AA1111 2007    3 03/01/2007  -5.3 -11.6   NA NA NA  0 32
# 3 AA1112 2008    3 03/01/2007 -10.3 -11.6 -8.5  0  0  0 NA
# 4 AA1112 2008    3 03/01/2007 -10.3    NA -8.5  0  0 NA 42
# 5 AA1112 2008    3 03/01/2007    NA -11.6 -8.5  0  0  0 NA

library(data.table)
setDT(DF) # convert to data table
DF_new <- DF[, names(DF)[5:11] := lapply(.SD, max, na.rm=TRUE),
             by=list(id,year,yday,date), .SDcols=5:11][,unique(.SD)]
DF_new
#        id year yday       date   t_1   t_2  t_3 r1 s1 p1 s2
# 1: AA1111 2007    3 03/01/2007  -5.3 -11.6 -8.5  0  0  0 32
# 2: AA1112 2008    3 03/01/2007 -10.3 -11.6 -8.5  0  0  0 42

setDF(DF_new) # convert back to data frame

.SD代表数据子集，包含by中指定的每个组的数据。它本身就是一个data.table。 .SDcols参数指出.SD应包含哪些列。语法LHS := RHS在RHS中运行表达式 - 此处，循环遍历.SD，其中包含.SDcols和计算max中指定的列，并通过引用更新LHS 中指定的列（就地）。

合并两个日期时间序列，重叠日期

1 个答案: