ANOVA测试

Question

我有一个与df结构有关的技术问题。看起来像这样：

    Month District   Age Gender Education Disability Religion                          Occupation JobSeekers GMI
1 2020-01      Dan   U17   Male      None       None   Jewish              Unprofessional workers          2   0
2 2020-01      Dan   U17   Male      None       None  Muslims          Sales and costumer service          1   0
3 2020-01      Dan   U17 Female      None       None    Other                           Undefined          1   0
4 2020-01      Dan 18-24   Male      None       None   Jewish         Production and construction          1   0
5 2020-01      Dan 18-24   Male      None       None   Jewish                     Academic degree          1   0
6 2020-01      Dan 18-24   Male      None       None   Jewish Practical engineers and technicians          1   0
  ACU NACU NewSeekers NewFiredSeekers
1   0    2          0               0
2   0    1          0               0
3   0    1          0               0
4   0    1          0               0
5   0    1          0               0
6   0    1          1               1

而且我正在寻找一种方法来对两个变量（例如区和求职者）进行卡方独立性检验，以便我可以判断北部地区与求职者的关系是否比南部地区更多。据我所知，数据结构出了点问题（区是一个字符，求职者是一个整数，根据我的地区，性别，职业等来指示我有多少求职者）我试图将其子集化为以下区域和求职者：

  Month   District  JobSeekers   GMI   ACU  NACU NewSeekers NewFiredSeekers
  <chr>   <chr>          <int> <int> <int> <int>      <int>           <int>
1 2020-01 Dan            33071  4694  9548 18829       6551            4682
2 2020-01 Jerusalem      21973  7665  3395 10913       3589            2260
3 2020-01 North          47589 22917  4318 20354       6154            3845
4 2020-01 Sharon         25403  6925  4633 13845       4131            2727
5 2020-01 South          37089 18874  2810 15405       4469            2342
6 2020-02 Dan            32660  4554  9615 18491       5529            3689

但是这使得处理起来更加困难我会接受任何其他当然可以使用的测试。

请帮助，让我知道是否需要更多信息，

Moshe

更新

# t test for district vs new seekers

# sorting

dist.newseek <- Cdata %>% 
  group_by(Month,District) %>% 
  summarise(NewSeekers=sum(NewSeekers))

# performing a t test on the mini table we created

t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))

# results

Welch Two Sample t-test

data:  NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -119952.3  200737.3
sample estimates:
  mean in group Dan mean in group South 
74608.25            34215.75 

#wilcoxon test 

# filtering Cdata to New seekers based on month and age

age.newseek <- Cdata %>% 
  group_by(Month,Age) %>% 
  summarise(NewSeekers=sum(NewSeekers))

#performing a wilcoxon test on the subset 

wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))

# Results

Wilcoxon rank sum exact test

data:  NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0

ANOVA测试

# Sorting occupation and month by new seekers

occu.newseek <- Cdata %>% 
  group_by(Month,Occupation) %>% 
  summarise(NewSeekers=sum(NewSeekers))

## Make the Occupation as a factor

occu.newseek$District <- as.factor(occu.newseek$Occupation)

## Get the occupation group means and standart deviations

group.mean.sd <- aggregate(
  x = occu.newseek$NewSeekers, # Specify data column
  by = list(occu.newseek$Occupation), # Specify group indicator
  FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)

## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)

## Run the Tukey Test to compare the groups 
TukeyHSD(anova_one_way)

## Check the mean differences across the groups 

library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
  geom_boxplot() +
  geom_jitter(shape = 15,
              color = "steelblue",
              position = position_jitter(0.21)) +
  theme_classic()

Plot

Answer 1

由于JobSeekers是连续的，因此无法执行卡方运算，因此，如果您想知道北区和南区之间存在差异，可以使用wilcoxon或t.test。这取决于您的数据。 wilcoxon基于排名，不需要您的数据进行正态分布。

假设您已统计每个地区和每月的求职人数：

df = data.frame(Month=rep(c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06"),3),
District=rep(c("Dan","North","South"),each=6),JobSeekers=rpois(18,20))

t.test的测试方法如下所示，但是，如果您的样本是成对的，例如，您每个月对北有12个值，对南方有对应的12个值，则需要设置paired = FALSE，请参见{ {3}}：

t.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))

    Welch Two Sample t-test

data:  JobSeekers by District
t = 0.27455, df = 9.9435, p-value = 0.7893
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.560951  4.560951
sample estimates:
mean in group North mean in group South 
               21.5                21.0

如果不确定样本是否呈正态分布，请使用wilcoxon：

wilcox.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))

    Wilcoxon rank sum test with continuity correction

data:  JobSeekers by District
W = 19.5, p-value = 0.8721
alternative hypothesis: true location shift is not equal to 0

Answer 2

您可以使用方差分析测试比较多个组。如果通过综合ANOVA测试发现任何统计学上显着的结果，则可以检查哪个区域的好坏不同。

您还可以访问UCLA的网站，该网站显示了应该使用哪些测试来测试其数据。链接为here。

作为一个简单的例子，让我在这里介绍如何运行ANOVA测试。

这是您的数据：

head(df)

r$> head(df)
    Month  District   Age Gender Education Disability Religion                          Occupation JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
1 2020-01       Dan 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
2 2020-01     North 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
3 2020-01     North 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
4 2020-01     South 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
5 2020-01       Dan 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
6 2020-01 Jerusalem 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1

由于我需要更多数据点来进行测试，因此我通过自举复制了您的数据。我还增加了南部和北部地区的求职者人数。您无需在数据中执行以下步骤。但这就是我的做法。

# For the sake of this example, I increased the number of observation by bootstrapping the example data

for(i in 1:20) df <- rbind(df[sample(6, 5), ],df)
rownames(df) <- 1:nrow(df)
df$District <- sample(c("Jerusalem", "North", "Sharon", "South", "Dan"), nrow(df),replace = T)
df$JobSeekers[df$District == "North"] <- sample(1:3,length(df$JobSeekers[df$District == "North"]),replace=T,p=c(0.1,0.5,0.4))
df$JobSeekers[df$District == "South"] <- sample(4:6,length(df$JobSeekers[df$District == "South"]),replace=T,p=c(0.1,0.5,0.4))

在分析分类变量时，最好将字符作为因素。这样，您可以控制因素的水平。

## Make the District as a factor

df$District <- as.factor(df$District)

接下来，获取组平均值和标准差，以查看组之间是否存在任何有意义的差异。如您所见，我更改了南部和北部地区，因此与其他地区相比，它们的平均得分最高。

## Get the group means and standart deviations
    group.mean.sd <- aggregate(
        x = df$JobSeekers, # Specify data column
        by = list(df$District), # Specify group indicator
        FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
    )

r$> group.mean.sd
    Group.1    x.mean      x.sd
1       Dan 1.1000000 0.3077935
2 Jerusalem 1.0000000 0.0000000
3     North 2.3225806 0.5992827
4    Sharon 1.1363636 0.3512501
5     South 5.2380952 0.4364358

最后，您可以按照以下步骤运行ANOVA测试和Tukey测试。

## Run one way ANOVA test
    anova_one_way <- aov(JobSeekers~ District, data = df)
    summary(anova_one_way)

r$> summary(anova_one_way)
             Df Sum Sq Mean Sq F value Pr(>F)
District      4 260.09   65.02   346.1 <2e-16 ***
Residuals   101  18.97    0.19
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Run the Tukey Test to compare the groups 
    TukeyHSD(anova_one_way)

r$> Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = JobSeekers ~ District, data = df)

$District
                        diff        lwr        upr     p adj
Jerusalem-Dan    -0.10000000 -0.5396190  0.3396190 0.9695592
North-Dan         1.22258065  0.8772809  1.5678804 0.0000000
Sharon-Dan        0.03636364 -0.3356042  0.4083315 0.9987878
South-Dan         4.13809524  3.7619337  4.5142567 0.0000000
North-Jerusalem   1.32258065  0.9132542  1.7319071 0.0000000
Sharon-Jerusalem  0.13636364 -0.2956969  0.5684241 0.9048406
South-Jerusalem   4.23809524  3.8024191  4.6737714 0.0000000
Sharon-North     -1.18621701 -1.5218409 -0.8505932 0.0000000
South-North       2.91551459  2.5752488  3.2557803 0.0000000
South-Sharon      4.10173160  3.7344321  4.4690311 0.0000000

最后，您可以使用条形图来绘制哪个地区的求职者最多。

## Check the mean differences across the groups 

library(ggplot2)
ggplot(df, aes(x = District, y = JobSeekers, fill = District)) +
    geom_boxplot() +
    geom_jitter(shape = 15,
        color = "steelblue",
        position = position_jitter(0.21)) +
    theme_classic()

更新

根据您的更新，您可以使用以下语法来缩写x标签并更改图例。

library(stringr)
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = str_wrap(Occupation,10))) +
    geom_boxplot() +
    geom_jitter(
        shape = 19,
        color = "black",
        position = position_jitter(0.21)
    ) +
     scale_x_discrete(
        labels =
            c(
                "Academic degree" = "Academic",
                "Practical engineers and technicians" = "Engineering",
                'Production and construction'='Production',
                "Sales and costumer service" = "Sales",
                "Unprofessional workers" = "Unprofessional",
                "Undefined" = "Undefined"
            )
    ) +
    labs(fill = "Occupation") +
        theme_classic()+
        theme(
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), legend.key.height=unit(2, "cm")
            #legend.position = "top",
            
        )

您应该得到一个这样的图。

卡方独立性检验

ANOVA测试

2 个答案:

更新