文本挖掘在大数据集中的应用

时间:2018-01-13 08:24:16

标签: r dplyr text-mining stringdist record-linkage

下面的R脚本计算“names1”和“names2”列中两个文本字符串之间的百分比相似度。但是,我的要求是对6k-10K +列项执行相同的操作。当下面的公式应用于如此大的列时,解决方案会进行折腾,因为订单项的数量达到数百万,并且对于企业交付而言并不重要。另外,与“百分比”列一起,我需要添加额外的6-7个其他列,这将使解决方案大小超过1 GB。请帮助我更新脚本其他可能的解决方案以实现相同的目标。非常感谢。

library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
names1 <- c("Adam Shaw","Justin Bose","Cydney Clide")
names2 <- c("Adam Shaw","Justin Bose","Cydney Clide")
names1 <- as.character(names1)
names2 <- as.character(names2)
Percent <- paste(round(unlist(lapply(1:length(names1), function(x) { 
levenshteinSim(names1[x], names2[-x])}))*100, 1), "%", sep="")

1 个答案:

答案 0 :(得分:1)

您可能会发现Vectorization很有用:

#Create a large character Vector:
names1<-as.character(rep(iris$Species,10))

# Use Lapply
system.time(Percent <- paste(round(unlist(lapply(1:length(names1), function(x) { 
  levenshteinSim(names1[x], names1[-x])}))*100, 1), "%", sep=""))

#Create Vectorized Function
fun1<-function(names,x) {
  return(levenshteinSim(names[x],names[-x]))
}

vecFun1<-Vectorize(fun1,vectorize.args = "x")


#Execute Vectorized Function
system.time(percentVec<-vecFun1(names1,c(1:length(names1))))
percentVec<-paste(as.character(round(c(percentVec)*100,1)),"%",sep="")

这是代码执行,矢量化时间不到1/3

> names1<-as.character(rep(iris$Species,10))
> system.time(Percent <- paste(round(unlist(lapply(1:length(names1), function(x) { 
+   levenshteinSim(names1[x], names1[-x])}))*100, 1), "%", sep=""))
   user  system elapsed 
   5.07    0.02    5.09 
> 
> fun1<-function(names,x) {
+   return(levenshteinSim(names[x],names[-x]))
+ }
> 
> vecFun1<-Vectorize(fun1,vectorize.args = "x")
> 
> system.time(percentVec<-vecFun1(names1,c(1:length(names1))))
   user  system elapsed 
   1.62    0.00    1.62 

我也将你的例子用于3个元素的字符向量

> names2<-c("Adam Shaw","Justin Bose","Cydney Clide")
> names2 <- as.character(names2)
> system.time(Percent <- paste(round(unlist(lapply(1:length(names2), function(x) { 
+   levenshteinSim(names2[x], names2[-x])}))*100, 1), "%", sep=""))
   user  system elapsed 
      0       0       0 
> 
> fun1<-function(names,x) {
+   return(levenshteinSim(names[x],names[-x]))
+ }
> 
> vecFun1<-Vectorize(fun1,vectorize.args = "x")
> 
> system.time(percentVec<-vecFun1(names2,c(1:length(names2))))
   user  system elapsed 
      0       0       0 
> 
> percentVec<-paste(as.character(round(c(percentVec)*100,1)),"%",sep="")
> 
> Percent
[1] "9.1%"  "16.7%" "9.1%"  "16.7%" "16.7%" "16.7%"
> percentVec
[1] "9.1%"  "16.7%" "9.1%"  "16.7%" "16.7%" "16.7%"