Question

我想检查两个数据集。一个数据有很多列（此示例有两列 df1 ），一个数据有一列（ df2 ）

首先，我想检查 df1 的第一列，如果找到任何类似的部分，那么 df2 的所有部分，然后是df1的行号和df2写的

例如，

df1的第1列在df1的第3行中具有行的两个相似部分至df2 Q9Y6Q9 ，在df2的第4行中具有 Q9Y6Q9 。所以输出是3-4，对其他人来说是相同的

Answer 1

也许您应该首先规范化您的数据。例如，您可以这样做：

normalize <- function(x, delim) {
    x <- gsub(")", "", x, fixed=TRUE)
    x <- gsub("(", "", x, fixed=TRUE)
    idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
    names <- unlist(strsplit(as.character(x), delim))
    return(setNames(idx, names))
}

此功能可以应用于df1的每一列以及查找表df2：

s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")

使用此标准化数据，可以轻松生成您要查找的输出：

process <- function(s) {
    lookup_try <- lookup[names(s)]
    found <- which(!is.na(lookup_try))
    pos <- lookup_try[names(s)[found]]
    return(paste(s[found], pos, sep="-"))
    #change the last line to "return(as.character(pos))" to get only the result as in the comment
}

process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4"  "3-15" "7-16"

输出与问题中的输出不完全相同，但这可能是由于手动查找错误造成的。

为了迭代df1的所有列，您可以使用lapply：

res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)

现在，res是由df1的列名索引的列表：

res[["sample_1"]]
# [1] "4" "1" "4"

在两个不同数据帧的每一行中找到相似的字符串

1 个答案: