使用data.table从列中提取值,该列的名称与多个data.frames循环中的值匹配

时间:2016-08-16 16:50:22

标签: r data.table

我正在处理在给定时间段内发生的GPS位置" dateperiod"。我想在一行(dateperiod)中使用该值,查看该dateperiod的列,并提取我正在处理的任何行的值(干扰距离)。但我也在循环中执行此操作多个干扰数据帧。虚拟数据集:

示例基本数据(data_basic_DT):

structure(list(EndId = 1:9, dateperiod = c(141101L, 141101L, 
141101L, 141101L, 141101L, 141101L, 141101L, 141101L, 141101L
)), .Names = c("EndId", "dateperiod"), row.names = c(NA, -9L), class = "data.frame")

示例干扰数据1(low_roads):

structure(list(EndId = 1:9, dateperiod = c(141101L, 141101L, 
141101L, 141101L, 141101L, 141101L, 141101L, 141101L, 141101L
), `151101` = c(710.211, 684.471, 676.831, 762.955, 704.06, 674.685, 
682.495, 686.586, 696.348), `150501` = c(710.211, 684.471, 676.831, 
762.955, 704.06, 674.685, 682.495, 686.586, 696.348), `141101` = c(710.211, 
684.471, 676.831, 762.955, 704.06, 674.685, 682.495, 686.586, 
696.348), `140501` = c(710.211, 684.471, 676.831, 762.955, 704.06, 
674.685, 682.495, 686.586, 696.348), `131101` = c(710.211, 684.471, 
676.831, 762.955, 704.06, 674.685, 682.495, 686.586, 696.348), 
    `130501` = c(710.211, 684.471, 676.831, 762.955, 704.06, 
    674.685, 682.495, 686.586, 696.348), `121101` = c(710.211, 
    684.471, 676.831, 762.955, 704.06, 674.685, 682.495, 686.586, 
    696.348)), .Names = c("EndId", "dateperiod", "151101", "150501", 
"141101", "140501", "131101", "130501", "121101"), row.names = c(NA, 
-9L), class = "data.frame")

防干扰数据2(high_roads):

structure(list(EndId = 1:9, dateperiod = c(141101L, 141101L, 
141101L, 141101L, 141101L, 141101L, 141101L, 141101L, 141101L
), `151101` = c(806.415, 802.56, 502.35, 1234.2, 704.06, 685.23, 
682.495, 1002.3, 696.348), `150501` = c(710.211, 684.471, 676.831, 
762.955, 704.06, 802.56, 502.35, 1234.2, 696.348), `141101` = c(710.211, 
130.25, 453.25, 762.955, 704.06, 674.685, 682.495, 686.586, 696.348
), `140501` = c(710.211, 684.471, 802.56, 502.35, 1234.2, 674.685, 
682.495, 686.586, 696.348), `131101` = c(710.211, 684.471, 676.831, 
762.955, 704.06, 674.685, 502.35, 1234.2, 704.06), `130501` = c(710.211, 
684.471, 676.831, 762.955, 704.06, 674.685, 682.495, 686.586, 
696.348), `121101` = c(502.35, 1234.2, 704.06, 762.955, 704.06, 
674.685, 682.495, 686.586, 696.348)), .Names = c("EndId", "dateperiod", 
"151101", "150501", "141101", "140501", "131101", "130501", "121101"
), row.names = c(NA, -9L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000006640788>)

因此,对于每个EndId,我希望它查看dateperiod,在此示例中看到它是141101,查看列&#34; 141101&#34;,提取值,并将其放入新列。在循环中经历low_roads和high_roads。

感谢一些帮助(下面),我的工作速度比以前快得多,用这个:

disturbancelist <- list(low_roads=low_roads, high_roads=high_roads) #Lists all the disturbance dataframes
for (d in disturbancelist){ 
  ##Create a column named by the current disturbance class
     Class<-d$Class[2] ##calls the disturbance type
  ##Merge basic data and each disturbance dateframe to get the right distance values
  mergeex<-merge(data_basic_DT, d, by.x = "EndId", by.y = "EndId", all.y == FALSE)
  mergeexdf<-as.data.frame(mergeex)
  col.names<-names(mergeexdf)
  mergeexdf$distance <- mergeexdf[cbind(1:nrow(mergeexdf), fmatch(mergeexdf$dateperiod, col.names))]
  names(data_basic_DT)[names(data_basic_DT)=="distance"] <- Class ##Change name of column to current disturbance class
  print(Class)
}

现在,我想更改此代码以在data.tables中工作,以使其运行更快。它在循环之外作为data.tables工作,但不在其中。任何帮助表示赞赏!

1 个答案:

答案 0 :(得分:0)

如果我理解你的话,这听起来像我回答的一个问题: R data.frame get value from variable which is selected by another variable, vectorized。虽然这个问题一般适用于data.frames,但我认为它仍然是data.table的一个很好的解决方案。编辑:可能不是,基于响应,但它在data.frames上运行良好至少。

我们的想法是使用matchnames属性来获取每行的列的数字索引,然后使用它来获取值。对于名为df的数据框:

,这样的事情
df$newvar <- df[cbind(1:nrow(df), match(df$dateperiod, names(df)))]

第一个数字1:nrow(df)基本上取代了for循环,第二个数字match(df$dateperiod, names(df))标识了名称与dateperiod中包含的值相匹配的列行。它有效,因为match对整个列向量df$dateperiod进行操作,并返回相同长度的列。

希望有所帮助。