Question

我有两个文件。第一个文件有三列：SiteID，Time和ClusterNo。

第二个文件有四列：SiteA_ID，SiteB_ID，Time和ClusterNo。

file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =  runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE)) 
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =     runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))

我们必须找出哪些集群（file1和file2）以file1的Site_ID与file2的Site（A或B）匹配的方式进行映射; file1的时间和file2的时间差异不超过2个单位。

所需的输出是一个包含三列的文件：ClusterNoOfFile1和ClusterNoOfFile2以及CommonSite

[注意：CommonSite是file1和file2的常见站点，群集正在映射]

Answer 1

下面是一种按照你想要的方式完成某些事情的方法（我不清楚你的输出应该是什么输入）。您可以根据您的具体需要对其进行修改。

library(dplyr)
library(tidyr)

# Generate the data (your code)
file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =  runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =     runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))

# Convert file2 to long format so there is only one site id
file2Long <- gather(file2, Site_Type, Site_ID, -Time, -ClusterNo.)

# Inner join with file1 so you retain all rows with matching site id.
file12 <- inner_join(file1, file2Long, by = 'Site_ID')

# Compute time difference and store whether it is within range
file12$TimeDiff2 <- abs(file12$Time.x - file12$Time.y) <= 2

# Filter the ones that meet the threshold criteria of 2, and retain only
# columns of interest.
file12Diff2 <- filter(file12, TimeDiff2 == TRUE)
file12Diff2 <- select(file12Diff2, ClusterNo..x, ClusterNo..y, Site_ID)

输出将如下所示（.x mens file1和.y表示file2 - 您可以将这些名称更改为您需要的名称）：

  ClusterNo..x ClusterNo..y Site_ID
1          400           96   74308
2          298          438   74027
3          397          137   74265
4          420          286   74395
5          280           77   74097
6          176          333   74303

基于两列映射两个文件的集群

1 个答案: