我有一个很大的字符串向量:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
我不想从同一个向量d中为每个字符串获取类似的字符串。
我是这样做的
1.根据某些规则计算每个字符串与所有其他字符串的编辑距离,例如,如果存在任何数字或者字母字符数小于5,则强制精确匹配。
2.将其与字符串一起放入数据帧dist中
3.基于距离的子集化dist&lt; 3.
4.将类似的字符串折叠并添加到原始数据框作为新列。
我正在使用stringr
和stringdist
个套件
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
for (i in 1:M){
# if string has digits or is of short size (<5) do exact matching
if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
# otherwise do approximate matching
} else {
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
}
# subset similar strings (with edit distance <3)
subDist <- subset(Dist, dist < 3 )
# add to original data.frame d
d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}
是否可以矢量化程序而不是使用循环?我有一个非常大的字符串向量,因此由于内存限制,无法在整个向量上使用stringdistmatrix
计算距离矩阵。循环适用于大数据,但速度很慢。
答案 0 :(得分:1)
stringdist
有一个用于计算矩阵中所有距离的版本,所以我认为像这样的东西将是一个改进,它在我的计算机上运行时包含100个代表行的速度大约是其四倍:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
#d <- rep(d, each=100) #make it a bit longer for timing
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
short <- stringdistmatrix(d$d[ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[!ind_short], d$d, method="lv", maxDist=3)
d$sim[ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
d$sim[!ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
基本策略是分为短组件和长组件,并使用stringdist的矩阵形式,然后使用粘贴折叠它们,并分配到d$sim
编辑添加:根据您关于无法同时处理整个矩阵的评论,请尝试选择chunk_length,以便stringdistmatrix()
适用于chunk_length*M
矩阵。当然,如果你将它设置为1,那么你将恢复原来的非正式形式
chunk_length <- 100
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
d$iter <- rep(1:M,each=chunk_length,length.out=M)
for (i in unique(d$iter))
{
in_iter <- (d$iter == i)
short <- stringdistmatrix(d$d[in_iter & ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[in_iter & !ind_short], d$d, method="lv", maxDist=3)
if(sum(in_iter & ind_short)==1) short <- t(short)
if(sum(in_iter & !ind_short)==1) long <- t(long)
if(sum(in_iter & ind_short)>0) d$sim[in_iter & ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
if(sum(in_iter & !ind_short)>0) d$sim[in_iter & !ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
}
答案 1 :(得分:0)
这不是一个真正的答案,但我认为在这个项目中提及agrep
可能对你有用可能是件好事。它做部分模式匹配。
> d <- c("herb", "market", "merchandise", "fun", "casket93",
"old", "herbb", "basket", "bottle", "plastic", "baskket",
"markket", "pasword", "plastik", "oldg", "mahagony",
"mahaagoni", "sim23", "asket", "trump" )
> agr <- sapply(d, function(x) agrep(x, d, value = TRUE))
> head(agr)
$herb
[1] "herb" "herbb"
$market
[1] "market" "markket"
$merchandise
[1] "merchandise"
$fun
[1] "fun"
$casket93
[1] "casket93"
$old
[1] "old" "pasword" "oldg"