如何找到2个向量的重复项,并在较长的向量中打印具有匹配项和对应值的矩阵?

时间:2019-06-04 02:37:56

标签: r

我有2列长的(约700,000行)CSV。一列带有位置的列,其位置写为cg,后跟8位数字(例如cg12345678),而对应的列具有r值(正好在-1和1之间的一位数字)。其他CSV小得多(约20行),仅包含一列带有某些cg位置的列。我只想打印与小excel的cg位置相对应的大excel的r值。

以下是第2列CSV的缩写示例:

cg07881041  -0.0192398465425986
*cg03513874 -0.339360471677652
cg25458538  0.0451334622844003
*cg09261072 0.208770797055665
cg02404579  -0.0166889943192668
cg22585117  -0.340873841270817
*cg25552317 -0.0372823043801581

以下是一列CSV的示例:

cg08829765
*cg25552317
*cg09261072
cg14370485
*cg03513874
cg10855276
cg12406992

在此示例中,我向匹配的3个位置添加了星星。所以我想打印以下矩阵:

Matching cg  corresponding rvalue
cg03513874  -0.339360471677652
cg09261072  0.208770797055665
cg25552317  -0.0372823043801581

1 个答案:

答案 0 :(得分:0)

这是dplyr的方法:

library(dplyr)
df_1col %>%
  left_join(df_2col) %>%
  filter(!is.na(p_value))

#Joining, by = "cg"
#          cg    p_value
#1 cg25552317 -0.0372823
#2 cg09261072  0.2087708
#3 cg03513874 -0.3393605

源数据:

df_2col <- read.table(
  header = T,
  stringsAsFactors = F,
  text = "cg  p_value
  cg07881041  -0.0192398465425986
cg03513874 -0.339360471677652
cg25458538  0.0451334622844003
cg09261072 0.208770797055665
cg02404579  -0.0166889943192668
cg22585117  -0.340873841270817
cg25552317 -0.0372823043801581")

df_1col <- data.frame(cg = c("cg08829765","cg25552317",
                         "cg09261072","cg14370485",
                         "cg03513874","cg10855276",
                         "cg12406992"), stringsAsFactors = F)