我有一个数据帧(df),它有两列名为“id”和“text”
id text
1 TV
2 Tv
3 T.V
4 Radio/TV
5 Car
6 CAR
7 car
我想在“text”列中标记/标记相似类型的行
预期产出:
id text type
1 TV tv
2 Tv tv
3 T.V tv
4 Radio/TV tv
5 Car car
6 CAR car
7 car car
我在研究时发现了以下内容,我在这里得到了逻辑并且它也执行了,但我无法弄清楚如何重新创建我的想法(预期输出)
# Importing the library
library(tm)
# Importing the data
corpus.tmp<-Corpus(VectorSource(df$text))
#Cleaning up
corpus.tmp<- tm_map(corpus.tmp,removePunctuation)
corpus.tmp<- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus, content_transformer(tolower))
corpus.tmp<- tm_map(corpus.tmp, removeWords, stopwords("english"))
# Document Matrix
TDM <- TermDocumentMatrix(corpus.tmp)
inspect(TDM)
tdm_tfxidf<-weightTfIdf(TDM)
# Converting to matrix
m<- as.matrix(tdm_tfxidf)
rownames(m)<- 1:nrow(m)
norm_eucl<- function(m)
m/apply(m,1,function(x) sum(x^2)^.5)
m_norm<-norm_eucl(m)
# Performing K means clustering
results<-kmeans(m_norm,5,5)
答案 0 :(得分:1)
如果文本列 在包含特殊字符的任何情况下包含car
或tv
的字符串,则可以删除特殊字符并检查字符串是否包含{ {1}}或tv
:
car
如果您还有更多要检查的名称,可以将最后两个步骤收集到## Your dataframe
df <- data.frame(id = seq(7), text = c("tv","TV","T.v","Radio/TV","Car","car","CAR"))
## Remove special characters
df$text <- gsub("[[:punct:]]", "", df$text)
## Logicals for which df$text contain "tv" or "car"
tv <- grepl("tv",df$text,ignore.case = TRUE)
car <- grepl("car",df$text,ignore.case = TRUE)
## Create df$type column and assign values
df$type[tv] <- "tv"
df$type[car] <- "car"
,但这种方法并非防故障 - 例如如果文字包含sapply
。