如何在所有对中

时间:2017-12-02 17:31:19

标签: r

我在这里问了一个很难解决的问题how can I group based on similarity in strings。我找到了一个好主意,我想尝试一下。

这是我的想法和数据(与该问题相同的数据)

df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L, 
    9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway", 
    " USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia", 
    "indiaAfghanestan ", "USA", "USAargentina "), class = "factor"), 
        value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L, 
        8L), .Label = c("1941029507", "2367321518", "2849255881", 
        "2913128511", "2927576083", "4550996370", "457707181.9", 
        "637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label", 
    "value"), class = "data.frame", row.names = c(NA, -10L))

1-我尝试计算每行中每个字符串的字母数 2-我尝试在每对之间执行adist

如果adist的输出类似于1,则它们属于一个组,如果不属于两个不同的组

要解决上述问题,我需要知道如何对我数据的第一列的所有字符串执行adjst

所以我的问题是以下

1-是否有与adjst相反的功能? 2-如何在所有组合中执行调整(一次基于最长到最短,例如,

adist("Afghanestankabolindia","Afghanestan")
adist("Afghanestankabolindia","Afghanestankabol")
adist("Afghanestankabolindia","indiaAfghanestan")
adist("Afghanestankabolindia","Holandnorway")
adist("Afghanestankabolindia","holand")
adist("Afghanestankabolindia","holandindia")
.
.
.

棘手的部分是它应该在引用和另一个之间发生一次,例如,它应该只计算一次之间的距离

Afghanestankabolindia and Afghanestan

而不是

Afghanestan and Afghanestankabolindia 

表示引用始终是最长的字符串

1 个答案:

答案 0 :(得分:0)

不确定您的预期输出格式是什么,但我认为这样做符合您的要求:

ref = as.character(df$label)
all_combs = as.data.frame(t(combn(ref[order(nchar(ref),decreasing = T)],2)))
all_combs$val = mapply(adist,all_combs$V1,all_combs$V2)

首先,我们创建所有组合(对ref向量进行排序,因此第一个元素总是较长的(即引用)。然后我们使用mapply计算所有组合的adist

输出:

                      V1                  V2 val
1  Afghanestankabolindia  USAargentinabrazil  15
2  Afghanestankabolindia   indiaAfghanestan   15
3  Afghanestankabolindia    Afghanestankabol   5
4  Afghanestankabolindia        Holandnorway  17
5  Afghanestankabolindia       USAargentina   17
6  Afghanestankabolindia        Afghanestan   10
7  Afghanestankabolindia         holandindia  13
8  Afghanestankabolindia              holand  16
9  Afghanestankabolindia                 USA  21
10    USAargentinabrazil   indiaAfghanestan   16
11    USAargentinabrazil    Afghanestankabol  13
12    USAargentinabrazil        Holandnorway  14
13    USAargentinabrazil       USAargentina    7
14    USAargentinabrazil        Afghanestan   15
15    USAargentinabrazil         holandindia  13
16    USAargentinabrazil              holand  16
17    USAargentinabrazil                 USA  16
18     indiaAfghanestan     Afghanestankabol  10
19     indiaAfghanestan         Holandnorway  14
...               .....                .....  ..

希望这有帮助!