我想从文本向量中提取一组特定的单词,如下表所示。这些单词xyz,xyz10,xyz +和xyz 10的集合是文本列中的产品名称,如果可用,则提取并将其分配给相应的列产品名称
text product_name
hi i want to purchase xyz xyz
is xyz10 not available xyz10
i have xyz+ xyz+
what abour xyz 10 xyz 10
下面是供参考的表
data <- data.frame(text=c("hi i want to purchase xyz","is xyz10 not
available","i have xyz+","what abour xyz 10"),
product <- c("xyz 10","xyz","xyz+","xyz10")
编辑:创建具有所有产品名称的数据框名称product_list,然后使用str_extract将其与主数据框中的Comment列匹配并提取出来。但是某些带有空格的产品名称(例如xyz 10)没有完全匹配。
library(stringr)
df_empty <- data.frame(Comment=c("hi i want to purchase xyz","is xyz10 not
available","i have xyz+","what abour xyz 10"),stringsAsFactors=FALSE)
product_list <- data.frame(product_list=c("xyz 10","xyz","xyz+","xyz10"),stringsAsFactors=FALSE)
df_empty$Product=""
for (i in 1:nrow(df_empty)){
print(i)
for (j in 1:nrow(product_list)){
df_empty$Comment[i]<-gsub("-"," ",df_empty$Comment[i])
product_match<-str_extract(df_empty$Comment[i],fixed(product_list$product_list[j],ignore_case = T))
if (!(is.na(product_match))){
df_empty$Product[i]<-product_match
df_empty$Product[i]<-toupper(df_empty$Product[i])
}
}
}