基于两列映射两个文件的集群

时间:2016-01-12 10:15:46

标签: r data-manipulation

我有两个文件。第一个文件有三列:SiteID,Time和ClusterNo。

第二个文件有四列:SiteA_ID,SiteB_ID,Time和ClusterNo。

file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =  runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE)) 
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =     runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))   

我们必须找出哪些集群(file1和file2)以file1的Site_ID与file2的Site(A或B)匹配的方式进行映射; file1的时间和file2的时间差异不超过2个单位。

所需的输出是一个包含三列的文件:ClusterNoOfFile1和ClusterNoOfFile2以及CommonSite

[注意:CommonSite是file1和file2的常见站点,群集正在映射]

1 个答案:

答案 0 :(得分:1)

下面是一种按照你想要的方式完成某些事情的方法(我不清楚你的输出应该是什么输入)。您可以根据您的具体需要对其进行修改。

library(dplyr)
library(tidyr)

# Generate the data (your code)
file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =  runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =     runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))

# Convert file2 to long format so there is only one site id
file2Long <- gather(file2, Site_Type, Site_ID, -Time, -ClusterNo.)

# Inner join with file1 so you retain all rows with matching site id.
file12 <- inner_join(file1, file2Long, by = 'Site_ID')

# Compute time difference and store whether it is within range
file12$TimeDiff2 <- abs(file12$Time.x - file12$Time.y) <= 2

# Filter the ones that meet the threshold criteria of 2, and retain only
# columns of interest.
file12Diff2 <- filter(file12, TimeDiff2 == TRUE)
file12Diff2 <- select(file12Diff2, ClusterNo..x, ClusterNo..y, Site_ID)

输出将如下所示(.x mens file1和.y表示file2 - 您可以将这些名称更改为您需要的名称):

  ClusterNo..x ClusterNo..y Site_ID
1          400           96   74308
2          298          438   74027
3          397          137   74265
4          420          286   74395
5          280           77   74097
6          176          333   74303