我的数据集如下
EstablishmentName Freq
bahria university 20
bahria university islamabad 12
arid agriculture 3
arid agriculture university 15
arid rawalpindi 9
college of e&me, nust 20
college of e & me (nust) 15
college of eme 30
如上所示,Bahria University和伊斯兰堡Bahria University几乎相同,其他字符串也是如此。我想将它们统一成一个这样的
预期产量
EstablishmentName Freq
Bahria University 32
Arid Agriculture 27
College of EME 30
我尝试了以下解决方案,但似乎没有用。
library(SnowballC)
library(dplyr)
mutate(df, word = wordStem(EstablishmentName)) %>%
group_by(EstablishmentName) %>%
summarise(total = sum(Freq))