我有一个大约8000个字符串的向量。向量中的每个元素都是公司名称。
我的目标
我的目标是将这些公司名称分组,以便每个群集包含一组彼此相似的公司名称(例如:ROYAL DUTCH SHELL,SHELL USA,BMCC SHELL等......将属于同一个群组/群集,因为他们都是基于壳牌的公司,即他们的名字中有“壳牌”字样。
在处理这种规模的矢量时,似乎需要永远使用我所采用的聚类技术来寻找相似公司名称的组。但是对于较小的向量,这种方法效果很好。
让我使用公司名称的示例向量来演示我的方法,该向量远小于原始公司名称。
使用一个小的字符串向量,这种方法非常有效。
矢量看起来像这样
string=c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
"GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
我的尝试
为了解决这个问题,我采用了层次聚类方法。
# load packages
pacman::p_load(stringdist, dplyr, tm, gplots)
但是先做一些准备工作
#Function to clean strings
str_clean <- function(strings) {
require(dplyr)
require(tm)
strings %>% tolower() %>% removePunctuation() %>% stripWhitespace() %>%
trim()
}
# Clean company names
clean_names = str_clean(string)
n = length(clean_names)
现在计算单词之间的距离,用于聚类
# Distance methods
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3) #size of q-gram
dist.methods <- list()
# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist = matrix(NA, ncol = n, nrow = n) #initialize empty matrix
# row.names(dist) = prods
for (i in 1:n) {
for (j in 1:n) {
dist[i, j] <- stringdist(clean_names[i], clean_names[j], method = methods[m],
q = q[m])
}
}
dist.methods[[m]] <- dist
}
完成距离计算后,我选择一种方法并设置适当的截止值
#hierarchical clustering with cut-off of 0.2
clusters <- hclust(as.dist(dist.methods[[3]]))
plot(clusters)
df=as.data.frame(cbind("Companies" = clean_names, "Cluster" = cutree(clusters, h = .99)))
结果数据框将所有公司名称分类为群集,就像我想要的那样。
df=df %>% group_by(Cluster)
但是,正如我所提到的,当我使用8000公司名称的原始矢量时,距离计算需要太长时间,我无法继续。
我的问题
当我使用更大的字符串向量时,是否有解决此方法的方法?
对于较大的向量,群集不是解决此问题的正确方法吗?在这种情况下,我还能做些什么来实现我的结果?
非常感谢任何帮助。
答案 0 :(得分:0)
摆脱内在的两个for循环,这会减慢你的速度并使用stringdistmatrix
你的向量很长但字符串很小你会看到基准底部。
library(stringdist)
strings <- c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
"GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
stringsBig <- rep(strings, 500)
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3) #size of q-gram
dist.methods <- list()
# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist.methods[[m]] <- stringdistmatrix(stringsBig, method = methods[[m]], q = q[[m]])
}
microbenchmark::microbenchmark(stringdistmatrix(stringsBig),
for (i in 1:length(strings)) {
for (j in 1:length(strings)) {
stringdist(strings[i], strings[j])
}
},times = 100)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# stringdistmatrix(strings) 105.212 131.2805 241.9271 251.2235 279.634 2909.624 100 a
# for loop 36147.878 38165.8480 40411.9772 39527.5500 42170.895 54151.457 100 b
microbenchmark::microbenchmark(stringdistmatrix(stringsBig), times=10)
# Unit: seconds
# expr min lq mean median uq max neval
# stringdistmatrix(stringsBig) 1.5324 1.585354 1.66592 1.655901 1.691157 1.825333 10