在R中我有2个数据集:group1
和group2
。
对于group 1
我有10 game_id
这是游戏的ID,我们有number
这是group1
这个游戏玩过的次数。
所以,如果我们输入
group1
我们得到这个输出
game_id number
1 758565
2 235289
...
10 87084
group2
我们得到
game_id number
1 79310
2 28564
...
10 9048
如果我想测试前{2} group1
group2
和game_id
之间是否存在统计差异,我可以使用Pearson卡方检验。
在R中我只创建矩阵
# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)
所以m
给了我们
a 758565 235289
b 79310 28564
在这里,我可以测试H:&#34; a独立于b&#34;,意味着group1
中的用户与game_id
相比,group2
的用户数超过2。
在R中,我们输入chisq.test(m)
,我们得到一个非常低的p值意味着我们可以拒绝H,这意味着a和b不是独立的。
如何找到game_id
中group1
比group2
中显着更多的Text
?
答案 0 :(得分:2)
我创建了一个只有3个游戏的简单版本。我正在使用卡方检验和比例检验。就个人而言,我更喜欢第二个,因为它可以让您了解您所比较的百分比。运行脚本并确保您了解该过程。
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084))
dt_group1
# game_id number_games
# 1 1 758565
# 2 2 235289
# 3 3 87084
# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games) # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games # just to get an idea about the percentages
dt_group1
# game_id number_games number_rest_games number_all_games Prc
# 1 1 758565 322373 1080938 0.70176550
# 2 2 235289 845649 1080938 0.21767113
# 3 3 87084 993854 1080938 0.08056336
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048))
# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games
# input the game id you want to investigate
input_game_id = 1
# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))
# perform chi sq test
chisq.test(dt_test)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])
# perform test of proportions
prop.test(x,y)
# 2-sample test for equality of proportions with continuity correction
#
# data: x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
# 0.02063233 0.02626776
# sample estimates:
# prop 1 prop 2
# 0.7017655 0.6783155
主要的是chisq.test
是一个比较计数/比例的测试,因此您需要为您比较的组提供“成功”和“失败”的数量(列联表作为输入)。 prop.test
是另一个计数/比例测试命令,您需要提供“成功”和“总计”的数量。
现在您对结果感到满意,并且您看到了该过程的工作方式,我将添加一种更有效的方法来执行这些测试。
第一个使用dplyr
和broom
个包:
library(dplyr)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = rbind(dt_group1, dt_group2)
dt %>%
group_by(group_id) %>% # for each group id
mutate(number_all_games = sum(number_games), # create new columns
number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) %>%
group_by(game_id) %>% # for each game
do(tidy(prop.test(.$number_games, .$number_all_games))) %>% # perform the test
ungroup()
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
另一个正在使用data.table
和broom
个包:
library(data.table)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))
# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]
dt[, `:=`(number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) , by=group_id]
# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# 1: 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2: 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3: 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
您可以看到每行代表一个游戏,比较在第1组和第2组之间。您可以从相应的列中获取p值,但也可以获得测试/比较的其他信息。