在数据框中的文本列上执行群集

时间:2016-03-04 12:05:26

标签: r nlp cluster-analysis

我有一个数据帧(df),它有两列名为“id”和“text”

id  text
1   TV
2   Tv
3   T.V
4   Radio/TV
5   Car
6   CAR
7   car 

我想在“text”列中标记/标记相似类型的行

预期产出:

id  text     type
1   TV       tv
2   Tv       tv
3   T.V      tv
4   Radio/TV tv
5   Car      car
6   CAR      car
7   car      car

我在研究时发现了以下内容,我在这里得到了逻辑并且它也执行了,但我无法弄清楚如何重新创建我的想法(预期输出)

# Importing the library
library(tm)

# Importing the data
corpus.tmp<-Corpus(VectorSource(df$text))

#Cleaning up
corpus.tmp<- tm_map(corpus.tmp,removePunctuation)
corpus.tmp<- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus, content_transformer(tolower))
corpus.tmp<- tm_map(corpus.tmp, removeWords, stopwords("english"))

# Document Matrix
TDM <- TermDocumentMatrix(corpus.tmp)
inspect(TDM)

tdm_tfxidf<-weightTfIdf(TDM)

# Converting to matrix
m<- as.matrix(tdm_tfxidf)
rownames(m)<- 1:nrow(m)

norm_eucl<- function(m)
  m/apply(m,1,function(x) sum(x^2)^.5)

m_norm<-norm_eucl(m)

# Performing K means clustering
results<-kmeans(m_norm,5,5)

1 个答案:

答案 0 :(得分:1)

如果文本列 在包含特殊字符的任何情况下包含cartv的字符串,则可以删除特殊字符并检查字符串是否包含{ {1}}或tv

car

如果您还有更多要检查的名称,可以将最后两个步骤收集到## Your dataframe df <- data.frame(id = seq(7), text = c("tv","TV","T.v","Radio/TV","Car","car","CAR")) ## Remove special characters df$text <- gsub("[[:punct:]]", "", df$text) ## Logicals for which df$text contain "tv" or "car" tv <- grepl("tv",df$text,ignore.case = TRUE) car <- grepl("car",df$text,ignore.case = TRUE) ## Create df$type column and assign values df$type[tv] <- "tv" df$type[car] <- "car" ,但这种方法并非防故障 - 例如如果文字包含sapply

之类的内容