我正在尝试使用R获取数据帧中某个字符串的分组计数,但到目前为止还无法提出解决方案。以下是一些示例数据和我尝试使用的代码,以便您对我要完成的工作有个大致的了解,并在下面做进一步的解释:
<%= pack.status.status %>
因此,我首先按季节对数据进行分组,然后尝试计算给定季节中任何情节的标题中出现“荷马”一词的总次数。
任何关于我犯错地方的建议将不胜感激。
最好, 柯蒂斯
答案 0 :(得分:4)
要向每行添加一个新变量,您需要使用mutate
函数。除非您要按组进行汇总,否则不需要group_by
:
simpson %>%
mutate(homer_count = str_count(episode_title, 'Homer'))
# A tibble: 100 x 5
season episode_title imdb_votes us_viewers_in_millions homer_count
<int> <chr> <int> <dbl> <int>
1 1 Simpsons Roasting on an Open Fire 3734 26.7 0
2 1 Bart the Genius 1973 24.5 0
3 1 Homer's Odyssey 1709 27.5 1
4 1 There's No Disgrace Like Home 1701 20.2 0
5 1 Bart the General 1732 27.1 0
6 1 Moaning Lisa 1674 27.4 0
7 1 The Call of the Simpsons 1638 27.6 0
8 1 The Telltale Head 1580 28 0
9 1 Life on the Fast Lane 1578 33.5 0
10 1 Homer's Night Out 1511 30.3 1
# ... with 90 more rows
如果您想统计每个季节使用Homer
的次数,请group_by
,然后使用summarize
生成一个新变量,每组一行:
simpson %>%
group_by(season) %>%
summarize(homer_count = sum(str_count(episode_title, 'Homer')))
# A tibble: 5 x 2
season homer_count
<int> <int>
1 1 2
2 2 2
3 3 4
4 4 2
5 5 7
答案 1 :(得分:2)
library(dplyr)
simpson %>%
mutate(counts = str_count(episode_title, "Homer")) %>% # count matches for each row (vectorised function)
group_by(season) %>% # for each season
summarise(sum_counts = sum(counts)) # sum counts
# # A tibble: 5 x 2
# season sum_counts
# <int> <int>
# 1 1 2
# 2 2 2
# 3 3 4
# 4 4 2
# 5 5 7