我有两个文件。第一个文件有三列:SiteID,Time和ClusterNo。
第二个文件有四列:SiteA_ID,SiteB_ID,Time和ClusterNo。
file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" = runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" = runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
我们必须找出哪些集群(file1和file2)以file1的Site_ID与file2的Site(A或B)匹配的方式进行映射; file1的时间和file2的时间差异不超过2个单位。
所需的输出是一个包含三列的文件:ClusterNoOfFile1和ClusterNoOfFile2以及CommonSite
[注意:CommonSite是file1和file2的常见站点,群集正在映射]
答案 0 :(得分:1)
下面是一种按照你想要的方式完成某些事情的方法(我不清楚你的输出应该是什么输入)。您可以根据您的具体需要对其进行修改。
library(dplyr)
library(tidyr)
# Generate the data (your code)
file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" = runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" = runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
# Convert file2 to long format so there is only one site id
file2Long <- gather(file2, Site_Type, Site_ID, -Time, -ClusterNo.)
# Inner join with file1 so you retain all rows with matching site id.
file12 <- inner_join(file1, file2Long, by = 'Site_ID')
# Compute time difference and store whether it is within range
file12$TimeDiff2 <- abs(file12$Time.x - file12$Time.y) <= 2
# Filter the ones that meet the threshold criteria of 2, and retain only
# columns of interest.
file12Diff2 <- filter(file12, TimeDiff2 == TRUE)
file12Diff2 <- select(file12Diff2, ClusterNo..x, ClusterNo..y, Site_ID)
输出将如下所示(.x mens file1和.y表示file2 - 您可以将这些名称更改为您需要的名称):
ClusterNo..x ClusterNo..y Site_ID
1 400 96 74308
2 298 438 74027
3 397 137 74265
4 420 286 74395
5 280 77 74097
6 176 333 74303