在没有首先将子集保存为新data.frame的情况下计算数据子集

时间:2014-04-02 02:33:27

标签: r dataframe data.table subset plyr

我有两个data.frames,我正在使用它们来创建一个新的变量C(一个标准化的距离测量)。每个data.frame都有以下信息(坐标,季节,变量。对于每个唯一的坐标季节(即每个XX,YY,我将在Cdf.a之间计算df.b - 按季划分的X,Y对。为此,我将两个data.frames(df.new)合并为准备计算C

以下是我目前如何执行此操作:

# for example, for season = SUM
# V1 and VV1 are the same variable from the different dataframes, SEA = Season, 
# X,Y and XX, YY are coordinates 
df.new.SUM <- subset(df.new, SEA == "SUM") # Summer
attach(df.new.SUM)
df.new.SUM$C_V1 <- (V1-VV1)^2/sd(V1)^2 # almost wouldn't need to subset except that the denominator here should only be for one season
df.new.SUM$C_V2 <- (V2-VV2)^2/sd(V2)^2
df.new.SUM$C <- sqrt(rowSums(df.new.SUM[,c("C_V1","C_V2")]))
# continue for other seasons and then rbind  

然而,这似乎看起来很笨重。有没有办法计算每个季节C - 坐标组没有子集化到data.frame然后每个季节进行rbinding?我如何只使用一个季节而不分组到新的data.frame?或者,更好的是,我如何以矢量化方式为每个季节做到这一点?我应该探索哪些包裹?

df.a <- structure(list(XX = c(10L, 10L, 11L, 11L, 12L, 12L, 13L, 13L, 
14L, 14L), YY = c(20L, 20L, 21L, 21L, 22L, 22L, 23L, 23L, 15L, 
15L), SEA = c("SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", 
"WIN", "SUM", "WIN"), VV1 = c(10.5, 15, 8, 8.5, 8, 7.5, 11, 13, 
15, 10), VV2 = c(13, 3, 3.5, 6, 3.5, 3, 5, 4, 5, 5)), .Names = c("XX", 
"YY", "SEA", "VV1", "VV2"), row.names = c(NA, -10L), class = "data.frame")
#
df.b <- structure(list(X = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Y = c(1L, 1L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L
), SEA = c("SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", "WIN", 
"SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", 
"WIN", "SUM", "WIN"), V1 = c(10, 12, 10, 9.5, 10, 14.5, 10.5, 
13, 11.5, 14, 12.5, 8.5, 10, 7.5, 11, 7, 11, 8, 11, 14.5), V2 = c(3.5, 
3, 3.5, 2.5, 3, 5, 5.5, 4, 2, 2.5, 3.5, 2, 3.5, 4.5, 5.5, 3.5, 
5, 6, 6, 5)), .Names = c("X", "Y", "SEA", "V1", "V2"), row.names = c(NA, 
-20L), class = "data.frame")
#
df.new <- merge(df.a, df.b, by = c("SEA"), all.x = TRUE, allow.cartesian=TRUE)
#
# EDIT ## solution based on suggestions below
df.out <- data.frame()
seasons <- unique(df.new$SEA)
for (s in seasons){
  data <- subset(df.new, SEA == s)
  data$C <- sqrt(with(data, (V1-VV1)^2/sd(V1)^2 +(V2-VV2)^2/sd(V2)^2 ))
  df.out <- rbind(df.out,data)

}

1 个答案:

答案 0 :(得分:1)

将这些步骤包装在一起,请不要在将来使用attach

df.new.SUM$C <- sqrt( with(df.new.SUM, (V1-VV1)^2/sd(V1)^2 +(V2-VV2)^2/sd(V2)^2 ) )

with功能更安全。但是,也许这不是你想要的。在merge的交叉产品中,合并数据集中有50个SEA ==“SUM”的“组合”,但这些并不是您的英语描述所指定的。