假设我有以下与汽车品牌有关的数据框。如何找到每个品牌(单词)的质心并将该质心归因于最“相似”的单词?为了获得第二列,请使用归一化标记pal_ok。
db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"))
pal1
1 fiat
2 fiat
3 fiat
4 fiat 1
5 fiatt
6 fait
7 fiaat
8 renault
9 renault
10 renault
11 renaultt
12 renault 3
13 renaultc
14 remault
db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"),
pal_ok =c("fiat","fiat","fiat","fiat","fiat","fiat","fiat","renault","renault","renault","renault","renault","renault","renault"))
pal1 pal_ok
1 fiat fiat
2 fiat fiat
3 fiat fiat
4 fiat 1 fiat
5 fiatt fiat
6 fait fiat
7 fiaat fiat
8 renault renault
9 renault renault
10 renault renault
11 renaultt renault
12 renault 3 renault
13 renaultc renault
14 remault renault
答案 0 :(得分:1)
您可以使用基本功能adist
和一些dplyr链来尝试此操作:
# here you calculate your "centroids", i.e. the most common words if you mean that
pal <- as.data.frame.table(table(db$pal1)) %>% # table of freq
arrange(Freq) %>% # arrange it
top_n(2) # take the top 2, consider your
# data to choose the tops
pal
Var1 Freq
1 fiat 3
2 renault 3
现在,我们可以计算每个“质心”与单词之间的距离:
# here the distance
dist <- data.frame(adist(db$pal1,pal$Var1))
# rename the columns, in this case with only two brands
colnames(dist) <- c('fiat','renault')
dist
fiat renault
1 0 5
2 0 5
3 0 5
4 2 6
5 1 5
6 2 5
7 1 5
8 5 0
9 5 0
10 5 0
11 6 1
12 7 2
13 6 1
14 5 1
现在我们可以选择距离最小的一个:
cbind(db,dist) %>% # bind data and freq
mutate(pal_calc = ifelse(fiat<renault,'fiat','renault')) %>% # choose the bigger
select(-c(fiat,renault)) # remove useless columns
pal1 pal_ok pal_calc
1 fiat fiat fiat
2 fiat fiat fiat
3 fiat fiat fiat
4 fiat 1 fiat fiat
5 fiatt fiat fiat
6 fait fiat fiat
7 fiaat fiat fiat
8 renault renault renault
9 renault renault renault
10 renault renault renault
11 renaultt renault renault
12 renault 3 renault renault
13 renaultc renault renault
14 remault renault renault