Question

我想使用功能str_count计算一列中某些字符串的出现次数。对于仅包含正确表达式的行，它可以正常工作。但是，对于包含一个NA的行，我得到了结果NA，并且我的列中包含很多NA。

我尝试使用tidyverse的summary函数，使用sum函数和％in％运算符以及常规比较来成功完成此任务。到目前为止，Sum和str_count给了我最有希望的结果。

# library(tidyverse)

# Reproducible data frame similar to the one I am working on
# This should resemble long data for two participants, that each have two 
# codes in a column
test <- data.frame(name = c("A1", "A1", "B1", "B1"), code_2 = c("SF08", "SF03", "SF03", NA))

# Here is my analysis that counts the number of matches of a code
analysis <- test %>% 
  group_by(name) %>% 
  summarize(
       total_sf2 = sum(stringr::str_count(code_2, "SF"))
       )

我希望参加者A1（我得到）有两场比赛，而参加者B2有一个比赛而不是结果NA

Answer 1

只需在您的求和电话中添加na.rm = TRUE：

test %>% 
   group_by(name) %>% 
   summarize(
     total_sf2 = sum(stringr::str_count(code_2, "SF"), na.rm=TRUE)
   )

# A tibble: 2 x 2
#  name  total_sf2
#  <fct>     <int>
#1 A1            2
#2 B1            1

Answer 2

在基数R中，您可以在regexpr中使用aggregate，而不受<NA>的影响。

aggregate(code_2 ~ name, test, function(x) sum(regexpr("SF", x)))
#   name code_2
# 1   A1      2
# 2   B1      1

Answer 3

使用grepl和data.table的选项

library(data.table)
setDT(test)[, .(code_2 = sum(grepl("SF", code_2))), name]
#   name code_2
#1:   A1      2
#2:   B1      1

Str_count：NA的问题和相似单词的多次出现

3 个答案: