获取数据框中的组的标准偏差

时间:2015-08-12 22:52:37

标签: r dataframe dplyr

我有一个格式如下的数据框:

user <- c(1,1,2,2,2,2,3,3,3)
answer_num <- c(1,2,3,3,4,4,5,5,6)
df <- data.frame(user,answer_num)

我正在尝试收集每个用户内有关答案实例数的统计信息。例如,我可以通过以下方式获得每个答案的平均实例数:

library(dplyr)
df %>% group_by(user) %>% summarise(inst_per_answer = n()/length(unique(answer_num)))

给了我:

  user inst_per_answer
1    1            1.0
2    2            2.0
3    3            1.5

我如何得到每个答案实例数的标准差?

澄清
我正在寻找每个答案实例数的标准差。例如,用户1具有1个答案1的实例和1个答案2的实例。因此,标准偏差为0 - sd(c(1,1))。用户3有2个答案5的实例和1个答案6的实例,对于sd为0.7 - sd(c(2,1))

2 个答案:

答案 0 :(得分:3)

试试这个

df %>%
  count(user, answer_num) %>%
  summarise(sd_per_user = sd(n))

# Source: local data frame [3 x 2]
# 
#   user sd_per_user
# 1    1   0.0000000
# 2    2   0.0000000
# 3    3   0.7071068

或更短的版本

data.table

library(data.table) setDT(df)[, .(sd_per_user = sd(table(answer_num))), by = user] # user sd_per_user # 1: 1 0.0000000 # 2: 2 0.0000000 # 3: 3 0.7071068 版本(使用@Thelas base R idea)

$('#needToOpenANativeSelectMenu').click(function(){
	//what to do to open up a native select menu?
});

答案 1 :(得分:1)

对于那些对sqldf感兴趣的人,有两个选择:

RSQLite STDEV

library(sqldf)
sqldf("SELECT user, STDEV(n) AS sd
      FROM (SELECT user, answer_num, count(answer_num) AS n 
      FROM df GROUP BY user,answer_num) 
      GROUP BY user")

RH2,STDDEVSTDDEV_SAMP

library(RH2)
sqldf("SELECT user, STDDEV(n) AS sd
      FROM (SELECT user, answer_num, COUNT(answer_num) AS n 
            FROM df GROUP BY user,answer_num) 
      GROUP BY user")

输出:

  user        sd
1    1 0.0000000
2    2 0.0000000
3    3 0.7071068