查找并估算单词(字符串)的质心

时间:2018-12-20 01:25:12

标签: r string similarity

假设我有以下与汽车品牌有关的数据框。如何找到每个品牌(单词)的质心并将该质心归因于最“相似”的单词?为了获得第二列,请使用归一化标记pal_ok。

db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"))

        pal1
1       fiat
2       fiat
3       fiat
4     fiat 1
5      fiatt
6       fait
7      fiaat
8    renault
9    renault
10   renault
11  renaultt
12 renault 3
13  renaultc
14   remault

db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"),
               pal_ok  =c("fiat","fiat","fiat","fiat","fiat","fiat","fiat","renault","renault","renault","renault","renault","renault","renault"))

        pal1  pal_ok
1       fiat    fiat
2       fiat    fiat
3       fiat    fiat
4     fiat 1    fiat
5      fiatt    fiat
6       fait    fiat
7      fiaat    fiat
8    renault renault
9    renault renault
10   renault renault
11  renaultt renault
12 renault 3 renault
13  renaultc renault
14   remault renault

1 个答案:

答案 0 :(得分:1)

您可以使用基本功能adist和一些dplyr链来尝试此操作:

# here you calculate your "centroids", i.e. the most common words if you mean that
pal <- as.data.frame.table(table(db$pal1)) %>%                    # table of freq
       arrange(Freq) %>%                                          # arrange it
       top_n(2)                                                   # take the top 2, consider your
                                                                  # data to choose the tops

 pal
     Var1 Freq
1    fiat    3
2 renault    3 

现在,我们可以计算每个“质心”与单词之间的距离:

# here the distance 
dist <- data.frame(adist(db$pal1,pal$Var1))

# rename the columns, in this case with only two brands
colnames(dist) <- c('fiat','renault')

 dist
   fiat renault
1     0       5
2     0       5
3     0       5
4     2       6
5     1       5
6     2       5
7     1       5
8     5       0
9     5       0
10    5       0
11    6       1
12    7       2
13    6       1
14    5       1

现在我们可以选择距离最小的一个:

cbind(db,dist) %>%                                               # bind data and freq
mutate(pal_calc = ifelse(fiat<renault,'fiat','renault')) %>%     # choose the bigger 
select(-c(fiat,renault))                                         # remove useless columns            

        pal1  pal_ok pal_calc
1       fiat    fiat     fiat
2       fiat    fiat     fiat
3       fiat    fiat     fiat
4     fiat 1    fiat     fiat
5      fiatt    fiat     fiat
6       fait    fiat     fiat
7      fiaat    fiat     fiat
8    renault renault  renault
9    renault renault  renault
10   renault renault  renault
11  renaultt renault  renault
12 renault 3 renault  renault
13  renaultc renault  renault
14   remault renault  renault