Question

我想根据某个列匹配两个数据框。我的数据框附在

下面

df <- structure(list(Read = structure(1:3, .Label = c("CC", "CG", "GC"
), class = "factor"), index = c(6L, 7L, 10L)), .Names = c("Read", 
"index"), row.names = c(NA, -3L), class = "data.frame")

df1 <- structure(list(Ref_base = structure(c(1L, 6L, 4L, 2L, 3L, 4L, 
3L, 5L), .Label = c("AT", "CC", "CG", "GC", "GT", "TG"), class = "factor"), 
    index = c(4L, 15L, 10L, 6L, 7L, 10L, 7L, 12L)), .Names = c("Ref_base", 
"index"), row.names = c(NA, -8L), class = "data.frame")

我使用match查找两个数据框之间的匹配

match(df$index,df1$index)

它给了我正确的结果4 5 3作为匹配的索引。但我想锁定位置4，这是第一场比赛的索引，并在4或第一个索引之后执行匹配。我不想在第一场比赛的索引之外执行搜索。例如，我有兴趣将索引返回为4,5,6，包括重复（如果有的话）。

Answer 1

第一种解决方案基本上不过是一个循环。它遍历df$index中的所有搜索元素，并返回tmp中的匹配索引。变量search_start用于让下一次搜索从最近的位置开始。由于search_start是在sapply中的匿名函数之外定义的，因此您必须使用<<-而不是=或<-来访问它。还有一些代码用于处理NA（我的答案的第一个版本中缺少这个代码）。

match_sapply=function(a,b) {
  search_start=1
  tmp2=sapply(a,function(x) {
    tmp=match(x,b[search_start:nrow(df1)])
    search_start<<-search_start+ifelse(is.na(tmp),0,tmp)
    tmp
  })
  #the following line updates all non-NA elements of tmp2 with its cumulative sum
  `[<-`(tmp2,!is.na(tmp2),cumsum(tmp2[!is.na(tmp2)]))
}

match_sapply(c(50,df$index,20),df1$index)
#[1] NA  4  5  6 NA

使用Recall的另一个版本。这是一种递归方法。 Recall再次调用调用它的函数（在我们的例子中为match_recall）。但是你可以提供不同的论点。 match_recall的论点是：x搜索字词，y目标向量，n递归级别（也选择x的特定元素），{{1起始索引（与先前解决方案中的si相同）。同样，有一些代码可以处理start_index s。

NA

动态使用匹配功能

1 个答案: