Kruskal-Wallis测试:为子集data.frame创建lapply函数?

时间:2018-05-04 17:49:26

标签: r loops

我有一组数值(val)按多个类别(distance& phase分组)。我想按Kruskal-Wallis test测试每个类别,其中val是因变量,distance是因子,phase将我的数据拆分为3组。

因此,我需要在Kruskal-Wallis测试中指定数据子集,然后将测试应用于每个组。但是,我无法让我的子集工作!

在R帮助中,指定subsetan optional vector specifying a subset of observations to be used.但是如何正确地将其添加到我的lapply函数中?

我的虚拟数据:

# create data
val<-runif(60, min = 0, max = 100)
distance<-floor(runif(60, min=1, max=3))
phase<-rep(c("a", "b", "c"), 20)

df<-data.frame(val, distance, phase)

# get unique groups
ii<-unique(df$phase)

# get basic statistics per group
aggregate(val ~ distance + phase, df, mean)

# run Kruskal test, specify the subset
kruskal.test(df$val ~df$distance,
             subset = phase == "c")

这很有效,所以我的子集应该正确设置为向量。 但是如何在lapply函数中使用它?

# DOES not work!!
lapply(ii, kruskal.test(df$val ~ df$distance,
                        subset = df$phase == as.character(ii))) 

我的总体目标是从kruskal.test创建一个函数,并将每个组的所有统计信息保存到一个表中。

非常感谢所有帮助。

2 个答案:

答案 0 :(得分:3)

通常您会先split ting,然后lapply ing。

这样的东西
lapply(split(df, df$phase), function(d) { kruskal.test(val ~ distance, data=d) })

会产生一个列表,按照阶段索引,kruskal.test的结果。

你的最终表达式不起作用,因为lapply需要一个函数,并且应用kruskal.test不会产生函数,它会导致运行该测试的结果。如果你用一个带索引的函数定义来包围它,那么它就会起作用,只是不那么惯用。

lapply(ii, function(i) { kruskal.test(df$val ~ df$distance, subset=df$phase==i )})

答案 1 :(得分:2)

虽然已经晚了,但它可能会帮助遇到同样问题的人。因此,我将使用 tidyverserstatix 包实现答案。 rstatix 包“提供了一个简单直观的管道友好框架,与用于执行基本统计测试的‘tidyverse’设计理念相一致”。

library(rstatix)
library(tidyverse)

df %>% 
  group_by(phase) %>% 
  kruskal_test(val ~ distance)

输出

# A tibble: 3 x 7
  phase .y.       n statistic    df     p method        
* <chr> <chr> <int>     <dbl> <int> <dbl> <chr>         
1 a     val      20    0.230      1 0.631 Kruskal-Wallis
2 b     val      20    0.0229     1 0.88  Kruskal-Wallis
3 c     val      20    0.322      1 0.570 Kruskal-Wallis

与@user295691 提供的相同。 数据

df = structure(list(val = c(93.8056977232918, 31.0681172646582, 40.5262873973697, 
47.6368983509019, 65.23181500379, 64.4571609096602, 10.3301600087434, 
90.4661140637472, 41.2359046051279, 28.3357713604346, 49.8977075796574, 
10.8744730940089, 5.31001624185592, 71.9248640118167, 99.0267782937735, 
73.7928744405508, 3.31214582547545, 40.2693636715412, 27.6980920461938, 
79.501334275119, 60.5167196830735, 89.9171086261049, 87.4633299885318, 
43.1893823202699, 91.1248738644645, 99.755659350194, 7.25280269980431, 
96.957387868315, 75.0860505970195, 52.3794749286026, 26.6221587313339, 
52.5518182432279, 24.1361060412601, 49.5364486705512, 65.5214034719393, 
38.9469220302999, 0.687191751785576, 19.3090825574473, 19.6511475136504, 
25.5966754630208, 7.33999472577125, 33.9820940745994, 50.3751677693799, 
10.811762069352, 17.2359711956233, 53.958406439051, 64.2723652534187, 
92.7404976682737, 26.824192632921, 30.0975760444999, 52.0105463219807, 
74.4495407678187, 56.0636054025963, 91.891074879095, 14.0827904455364, 
59.3607738381252, 66.5170294465497, 24.1726311156526, 83.0881901318207, 
35.5380675755441), distance = c(2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 
2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 
1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 
1, 2, 1, 1, 2, 2, 2, 2), phase = c("a", "b", "c", "a", "b", "c", 
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", 
"b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", 
"c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", 
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", 
"b", "c")), class = "data.frame", row.names = c(NA, -60L))