如何获得2列之间的最小差异

时间:2019-08-15 15:45:07

标签: r

我想获得两列之间的最小距离,但是在A列和B列中都可能出现相同的名称。

Patient1    Patient2    Distance
A           B           8
A           C           11
A           D           19
A           E           23
B           F           6
C           G           25

所以我需要的输出是:

Patient Patient_closest_distance Distance
A       B                        8
B       F                        6
c       A                        11

我尝试使用列表功能

library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]

但是,我只是获得每列的最小距离,即C在两列中都将有2个结果,而不是考虑两列都显示最近的患者。另外,我只会得到一个距离列表,所以看不到哪个患者与哪个患者相关;

Patient1 SNP

1:A 8

我尝试在R Studio中使用列表功能

library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]

1 个答案:

答案 0 :(得分:1)

下面的代码有效。

# Create sample data frame
df <- data.frame(
  Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
  Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
  Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);

# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)

# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))

# Resort, min to top.  
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]