我想了解如何完成" group by"和"计数"整齐的功能。我看了很多帖子,却没有找到我想要的东西;如果已经发布了这个答案,我会很感激这个链接。
例如,我正在寻找数据中的异常值;我想知道哪些地方收到的最多"坏"措施:
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
> df
place measure rating
1 AL meas1 good
2 AK meas1 good
3 AZ meas1 good
4 AR meas1 bad
5 CA meas1 bad
6 CO meas1 bad
7 CT meas1 bad
8 DE meas1 good
9 FL meas1 good
10 GA meas1 good
....(etc).....
我想了解如何使用tidyverse来做到这一点。这种使用sqldf的方法给了我我想要的东西,即告诉我哪些地方最多"坏"评级,并根据他们的坏事"
对这些地方进行排名library(sqldf)
sqldf("SELECT place, rating, COUNT(*) AS Count FROM df GROUP BY place, rating ORDER BY rating, count DESC").
place rating Count
1 CA bad 3
2 AK bad 2
3 AR bad 1
4 CO bad 1
5 CT bad 1
6 DE bad 1
7 FL bad 1
8 GA bad 1
9 AL good 4
10 AZ good 4
11 HI good 4
....(etc)....
有没有办法在tidyverse中获得类似的结果?
答案 0 :(得分:1)
为了介绍tidyverse中的这些基本操作,我建议首先阅读Wickham和Grolemund的优秀 R for Data Science :http://r4ds.had.co.nz/
您可以使用dplyr和magrittr包以易于理解的方式执行以下操作:
# Install the tidyverse
library(tidyverse)
# Create data
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
# Do some analysis
df %>%
group_by(place) %>%
summarise(mean_score = mean(rating == "good"), n = n()) %>%
arrange(desc(mean_score))
在这里,我们" group by"餐馆名称"然后" "总结"每个分组按“好”的平均数进行分组。它收到的评级(创建一个新变量),"然后" "安排"这个" mean_score'降序输出。
我们还创造了新的' n'汇总函数中的变量计算每个均值所依据的评级数量(即如果我们看到一个餐馆只有2个评级,我们就会知道平均值可能不具代表性:请参阅http://www.evanmiller.org/how-not-to-sort-by-average-rating.html综合这个例子)。