我正在尝试在字符串之间计算匹配的项目:
target_str = "a,b,c"
table1 = data.frame(name = c("p1","p2","p3","p4"),
str = c("a,b","a","d,e,f","a,a"))
根据target_str
,计算匹配数量。我希望我的输出表看起来像这样:
name matches
p1 2 #matches a and b
p2 1 #matches a
p3 0 #no matches
p4 1 #if has duplicate, count only once
我有大约100万个target_str
需要计算匹配,因此速度非常重要。感谢任何建议。提前谢谢!
答案 0 :(得分:2)
target_str = "a,b,c"
split_str <- strsplit(target_str, split = ",")[[1]]
table1 = data.frame(name = c("p1","p2","p3","p4"),
str = c("a,b","a","d,e,f","a,a"))
data.frame(name = table1$name,
matches = rowSums(sapply(split_str, grepl, x = table1$str)))
# name matches
# 1 p1 2
# 2 p2 1
# 3 p3 0
# 4 p4 1
答案 1 :(得分:1)
这应该相当快:
# target string modified to be a character vector:
target_str <- unlist(strsplit(c("a,b,c"), split=","))
# separate each obervations strings:
stringList <- sapply(s, strsplit, split=",")
# get counts, put into data.frame
table1$Counts <- sapply(stringList, function(i) sum(i %in% target_str))
答案 2 :(得分:1)
此cbinds计数到第一列,保留为drop = FALSE的数据帧。从连续测试中加入计数,用于&#34; in-ness&#34;与grepl
:
cbind( table1[ ,1,drop=FALSE], counts=rowSums(sapply( scan(text=target_str, sep= ",", what=""), function(t) { grepl( t, table1$str)})) )
Read 3 items
name counts
a p1 2
b p2 1
c p3 0