我有一个看起来像这样的数据框
Word<-c("bat", "cat", "cab", "some", "ban", "bait", "at", "done", "dot", "ran", "cant")
S1<-c("b","c","c","s", "b", "b", "a", "d","d", "r", "c")
S2<-c("a","a","a","o","a","a","t","o","o","a","a")
S3<-c("t","t","b","m", "n", "i", "", "n","t", "n", "n")
S4<-c("","","","e", "", "t", "", "e","", "", "t")
df<-data.frame(Word, S1, S2, S3, S4, stringsAsFactors=FALSE)
我想计算相似发音的单词的数量和名称。相似的发音词是指通过添加,替换或删除而相差一个声音的单词。简而言之,我想要这样的东西
Word<-c("bat", "cat", "cab", "some", "ban", "bait", "at", "done", "dot", "ran", "cant")
S1<-c("b","c","c","s", "b", "b", "a", "d","d", "r", "c")
S2<-c("a","a","a","o","a","a","t","o","o","a","a")
S3<-c("t","t","b","m", "n", "i", "", "n","t", "n", "n")
S4<-c("","","","e", "", "t", "", "e","", "", "t")
Number<-c(4,4,1,0,2,1,2,0,0,1,2)
Names<-c("cat, ban, bait, at", "bat, cab, at, cant","cat","","bat, ran","bat","bat, cat","","","ban","can, cat")
df2<-data.frame(Word, S1, S2, S3, S4, Number, Names, stringsAsFactors=FALSE)
答案 0 :(得分:3)
如果我的理解正确,似乎您正在寻找主题词之间的Levenshtein distance。 utils软件包中的adist
函数可以为您计算Levenshtein距离。它返回一个矩阵,其中包含要从第i个单词到第j个单词的替换/插入/删除的数量。
dist <- utils::adist(Word)
dist
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 0 1 2 4 1 1 1 4 2 2 2
[2,] 1 0 1 4 2 2 1 4 2 2 1
[3,] 2 1 0 4 2 3 2 4 3 2 2
[4,] 4 4 4 0 4 4 4 2 3 4 4
[5,] 1 2 2 4 0 2 2 3 3 1 2
[6,] 1 2 3 4 2 0 2 4 3 3 2
[7,] 1 1 2 4 2 2 0 4 2 2 2
[8,] 4 4 4 2 3 4 4 0 2 3 3
[9,] 2 2 3 3 3 3 2 2 0 3 3
[10,] 2 2 2 4 1 3 2 3 3 0 2
[11,] 2 1 2 4 2 2 2 3 3 2 0
然后,您可以在行或列上循环并返回距离为1的任何单词。
links <- apply(dist, 1, function(d) {
paste0(Word[d == 1], collapse = ", ")
})
cbind.data.frame(Word, links)
Word links
1 bat cat, ban, bait, at
2 cat bat, cab, at, cant
3 cab cat
4 some
5 ban bat, ran
6 bait bat
7 at bat, cat
8 done
9 dot
10 ran ban
11 cant cat
现在您已经以编程方式派生了df2
的第一列和最后一列。对于计数,您可以简单地使用:
counts <- apply(dist, 1, function(d){sum(d == 1)})