与this情况类似,我想计算出使用字符串包的str_count在句子向量中出现的多个单词和数字的出现次数。
但是我注意到不仅要计算整数,还要计算部分数。例如:
df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
library(stringr)
number_of_keywords_df <- str_count(df, paste(keywords, collapse='|'))
这里我收到number_of_keywords_df为3,3,3的向量,但显然应该是3,2,2。str_count函数似乎计算数字“140000”内的部分字符串“1400”和“159”和“159000”。有没有办法阻止它?
答案 0 :(得分:1)
尝试在关键字周围添加字词边界:
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
keywords <- paste0("\\b", keywords, "\\b")
在正则表达式术语中,\bhonda\b
表示要匹配孤立的单词honda
。因此hondas
不匹配,因为它最后有一个额外的字母。
答案 1 :(得分:1)
使用sprintf可以添加单词边界:
number_of_keywords_df <- str_count(df, paste(sprintf("\\b%s\\b", keywords), collapse = '|'))
number_of_keywords_df
哪个收益
[1] 3 2 2