我在这里问了一个很难解决的问题how can I group based on similarity in strings。我找到了一个好主意,我想尝试一下。
这是我的想法和数据(与该问题相同的数据)
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
1-我尝试计算每行中每个字符串的字母数
2-我尝试在每对之间执行adist
如果adist
的输出类似于1,则它们属于一个组,如果不属于两个不同的组
要解决上述问题,我需要知道如何对我数据的第一列的所有字符串执行adjst
。
所以我的问题是以下
1-是否有与adjst相反的功能? 2-如何在所有组合中执行调整(一次基于最长到最短,例如,
adist("Afghanestankabolindia","Afghanestan")
adist("Afghanestankabolindia","Afghanestankabol")
adist("Afghanestankabolindia","indiaAfghanestan")
adist("Afghanestankabolindia","Holandnorway")
adist("Afghanestankabolindia","holand")
adist("Afghanestankabolindia","holandindia")
.
.
.
棘手的部分是它应该在引用和另一个之间发生一次,例如,它应该只计算一次之间的距离
Afghanestankabolindia and Afghanestan
而不是
Afghanestan and Afghanestankabolindia
表示引用始终是最长的字符串
答案 0 :(得分:0)
不确定您的预期输出格式是什么,但我认为这样做符合您的要求:
ref = as.character(df$label)
all_combs = as.data.frame(t(combn(ref[order(nchar(ref),decreasing = T)],2)))
all_combs$val = mapply(adist,all_combs$V1,all_combs$V2)
首先,我们创建所有组合(对ref
向量进行排序,因此第一个元素总是较长的(即引用)。然后我们使用mapply计算所有组合的adist
。
输出:
V1 V2 val
1 Afghanestankabolindia USAargentinabrazil 15
2 Afghanestankabolindia indiaAfghanestan 15
3 Afghanestankabolindia Afghanestankabol 5
4 Afghanestankabolindia Holandnorway 17
5 Afghanestankabolindia USAargentina 17
6 Afghanestankabolindia Afghanestan 10
7 Afghanestankabolindia holandindia 13
8 Afghanestankabolindia holand 16
9 Afghanestankabolindia USA 21
10 USAargentinabrazil indiaAfghanestan 16
11 USAargentinabrazil Afghanestankabol 13
12 USAargentinabrazil Holandnorway 14
13 USAargentinabrazil USAargentina 7
14 USAargentinabrazil Afghanestan 15
15 USAargentinabrazil holandindia 13
16 USAargentinabrazil holand 16
17 USAargentinabrazil USA 16
18 indiaAfghanestan Afghanestankabol 10
19 indiaAfghanestan Holandnorway 14
... ..... ..... ..
希望这有帮助!