Question

我正在进行情感分析，但是我需要在每条推文中按n char进行过滤。我的意思是：

df <- c("the most beauty", "the most ugly", "you are beauty")
Library(dplyr)
df %>%
filter((n char >3) %in% df)

我期望的结果是：“最美丽”，“丑陋”，“美丽”

我尝试过使用$str_detect，但没用

Answer 1

我们可以使用正则表达式来匹配具有1到3个字符的单词并将其替换为空白（""）

gsub("\\s*\\b[^ ]{1,3}\\b\\s*", "", df)
#[1] "most beauty" "most ugly"   "beauty"

注意：'df'是vector而不是data.frame/tbl_df。因此，带有tidyverse的{{1}}方法不起作用

Answer 2

对于情感分析，通过预定的nchar()进行过滤可能会有些粗糙。我建议您使用the tidytext library，它将使您可以将有意义的文本单元（如单词）标记为整齐的数据结构。

在您的情况下，您可以将每个单词转换为一个标记并重新调整数据框的形状，以便每个标记（或单词）位于单独的行上。然后，您可以轻松地过滤掉文章和其他不相关的内容。例如：

library(dplyr)
library(tidytext)

df <- c("the most beauty", "the most ugly", "you are beauty")
text_df <- data_frame(line = 1:3, text = df)
text_df %>%
   unnest_tokens(word, text)

# A tibble: 9 x 2
   line word  
  <int> <chr> 
1     1 the   
2     1 most  
3     1 beauty
4     2 the   
5     2 most  
6     2 ugly  
7     3 you   
8     3 are   
9     3 beauty

然后，只需过滤掉带有不想要的单词的向量即可。

remove_words <- c("the", "a", "you", "are")
text_df %>%
  unnest_tokens(word, text) %>% filter(!(word %in% remove_words))

# A tibble: 5 x 2
   line word  
  <int> <chr> 
1     1 most  
2     1 beauty
3     2 most  
4     2 ugly  
5     3 beauty

通过标记化，您可以通过汇总推文中所有单词的情感分数来轻松计算每个推文的情感分数。可以在这里找到示例：https://www.tidytextmining.com/sentiment.html

在r的值中过滤n char

2 个答案: