在dplyr中Shapiro.test同时在多个列上

时间:2017-10-26 13:26:02

标签: r dplyr

我正在尝试对数据集运行正态性测试(shapiro-wilk),我想同时对所有列进行统计和p值。我已经阅读了关于此的所有其他页面(R: Shapiro test by group won't produce p-values and corrupt data frame warningUsing shapiro.test on multiple columns in a data frame),但仍然无法弄明白。任何帮助将不胜感激!!

敌人,例如,这里是数据集:有一个字符向量(NVL),其余为数字,我想按NVL(NV / VL)进行分组。

     NVL  Var1  Var2  Var3  Var 4  Var 5
1.   NV   22.5  26.8   89.2  35.7   100
2.   NV   34.7  67.4   29.8  12.4   100
3.   NV   68.3  34.5   44.5  23.8   100
4.   NV   11.2  55.3   17.5  77.9   100
5.   VL   55.6  77.2   59.7  89.6   100
6.   VL   60.5  88.7   65.4  99.6   100
7.   VL   89.4  87.5   65.9  89.5   100
8.   VL   65.4  74.2   75.4  89.5   100
9.   VL   81.8  78.5   95.4  92.5   100

以下是代码:

library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise(statistic = shapiro.test(Var1)$statistic, 
p.value = shapiro.test(Var1)$p.value)

这是输出:

NVL statistic   p.value
  <chr>     <dbl>     <dbl>
1    VL 0.9125239 0.1985486
2    NV 0.8983501 0.2101248

现在,我编辑这段代码,以便我可以同时获得所有变量(Var2,3,4,5)的输出吗?我甚至尝试过聚合和祝福,但我被卡住了。

aggregate(formula = Var1 ~ NVL,
data = mydata,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)}) 

如您所见,我只能为一个变量执行此操作!我知道我很亲密,但我不能再弄清楚了!提前感谢您的帮助!!

4 个答案:

答案 0 :(得分:2)

我建议构建一个长格式(&#34;整洁&#34;)数据集并使用tidiverse函数。

加载套餐:

library(dplyr)      # for data manipulation functions
library(tidyr)      # for data manipulation functions
library(data.table) # for function `fread`
library(broom)      # for function `tidy`

将数据读入R

data <- fread(
"NVL   Var1  Var2   Var3  Var4   Var5
  NV   22.5  26.8   89.2  35.7   100
  NV   34.7  67.4   29.8  12.4   100
  NV   68.3  34.5   44.5  23.8   50
  NV   11.2  55.3   17.5  77.9   100
  VL   55.6  77.2   59.7  89.6   100
  VL   60.5  88.7   65.4  99.6   100
  VL   89.4  87.5   65.9  89.5   100
  VL   65.4  74.2   75.4  89.5   90
  VL   81.8  78.5   95.4  92.5   90")

进行分析:

# 1. Gather the variables that values should be tested.
# 2. Group by variable with variable names (`variable_name`) and 
#    by all group variables (in our case `NVL`).
# 3. Do the test for `value` and tidy the result.
# 4. Ungroup (it's a good practice to do this). 
# 5. Remove unnecessary information (column `method`).

sw_test_results <- data %>% 
    gather(key = "variable_name", value = "value", Var1:Var5) %>% 
    group_by(variable_name, NVL)  %>% 
    do(tidy(shapiro.test(.$value))) %>% 
    ungroup() %>% 
    select(-method)

sw_test_results

结果:

# A tibble: 10 x 4
   variable_name NVL   statistic p.value
   <chr>         <chr>     <dbl>   <dbl>
 1 Var1          NV        0.931 0.602  
 2 Var1          VL        0.915 0.498  
 3 Var2          NV        0.941 0.660  
 4 Var2          VL        0.874 0.282  
 5 Var3          NV        0.910 0.480  
 6 Var3          VL        0.864 0.245  
 7 Var4          NV        0.900 0.433  
 8 Var4          VL        0.726 0.0176 
 9 Var5          NV        0.630 0.00124
10 Var5          VL        0.684 0.00647

答案 1 :(得分:1)

mydata <- read.table(text="
   NVL  Var1  Var2  Var3  Var4  Var5
1   NV   22.5  26.8   89.2  35.7   100
2   NV   34.7  67.4   29.8  12.4   100
3   NV   68.3  34.5   44.5  23.8   50
4   NV   11.2  55.3   17.5  77.9   100
5   VL   55.6  77.2   59.7  89.6   100
6   VL   60.5  88.7   65.4  99.6   100
7   VL   89.4  87.5   65.9  89.5   100
8   VL   65.4  74.2   75.4  89.5   90
9   VL   81.8  78.5   95.4  92.5   90
", header=T)

library(dplyr)
myfun <- function(x, group) {
  data.frame(x, group) %>%
  group_by(group) %>%
  summarise(
    statistic = ifelse(sd(x)!=0,shapiro.test(x)$statistic,NA), 
    p.value = ifelse(sd(x)!=0,shapiro.test(x)$p.value,NA)
  )
}
(lst <- lapply(mydata[,-1], myfun, group=mydata[,1]))

输出结果为:

$Var1
# A tibble: 2 x 3
   group statistic   p.value
  <fctr>     <dbl>     <dbl>
1     NV 0.9313476 0.6023421
2     VL 0.9149572 0.4979450

$Var2
# A tibble: 2 x 3
   group statistic   p.value
  <fctr>     <dbl>     <dbl>
1     NV 0.9409576 0.6601747
2     VL 0.8736587 0.2815562

$Var3
# A tibble: 2 x 3
   group statistic   p.value
  <fctr>     <dbl>     <dbl>
1     NV 0.9096322 0.4804557
2     VL 0.8644349 0.2446131

$Var4
# A tibble: 2 x 3
   group statistic    p.value
  <fctr>     <dbl>      <dbl>
1     NV 0.9003135 0.43261822
2     VL 0.7260939 0.01760713

$Var5
# A tibble: 2 x 3
   group statistic     p.value
  <fctr>     <dbl>       <dbl>
1     NV 0.6297763 0.001240726
2     VL 0.6840289 0.006470001

lst输出列表可以转换为data.frame对象:

do.call(cbind, lst)[,-seq(4,3*(ncol(mydata)-1),3)]

这是输出:

  Var1.group Var1.statistic Var1.p.value Var2.statistic Var2.p.value Var3.statistic Var3.p.value Var4.statistic Var4.p.value Var5.statistic Var5.p.value
1         NV      0.9313476    0.6023421      0.9409576    0.6601747      0.9096322    0.4804557      0.9003135   0.43261822      0.6297763  0.001240726
2         VL      0.9149572    0.4979450      0.8736587    0.2815562      0.8644349    0.2446131      0.7260939   0.01760713      0.6840289  0.006470001

答案 2 :(得分:1)

只需使用summarise_all

mydata <- read.table(text="
   NVL  Var1  Var2  Var3  Var4  Var5
1   NV   22.5  26.8   89.2  35.7   100
2   NV   34.7  67.4   29.8  12.4   100
3   NV   68.3  34.5   44.5  23.8   50
4   NV   11.2  55.3   17.5  77.9   100
5   VL   55.6  77.2   59.7  89.6   100
6   VL   60.5  88.7   65.4  99.6   100
7   VL   89.4  87.5   65.9  89.5   100
8   VL   65.4  74.2   75.4  89.5   90
9   VL   81.8  78.5   95.4  92.5   90
", header=T)


library(dplyr)
normalityVar1<-mydata %>%
  group_by(NVL) %>%
  summarise_all(.funs = funs(statistic = shapiro.test(.)$statistic, 
                             p.value = shapiro.test(.)$p.value))

使用所需的输出:

normalityVar1
# A tibble: 2 x 11
    NVL Var1_statistic Var2_statistic Var3_statistic Var4_statistic Var5_statistic Var1_p.value Var2_p.value Var3_p.value
  <fctr>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>        <dbl>        <dbl>        <dbl>
1     NV      0.9313476      0.9409576      0.9096322      0.9003135      0.6297763    0.6023421    0.6601747    0.4804557
2     VL      0.9149572      0.8736587      0.8644349      0.7260939      0.6840289    0.4979450    0.2815562    0.2446131
# ... with 2 more variables: Var4_p.value <dbl>, Var5_p.value <dbl>

请注意,您首先拥有所有统计信息,然后是所有p值。如果需要,对列进行重新排序应该很简单。

答案 3 :(得分:0)

基于@GegznaV的出色回答, 这是一个更新,其中更新的结构docker run -d -p 5001:5000 -p 10001:10000 -e WALLET_STORAGE_CONFIG="{\"url\":\"xx.xx.xxx.xxx:5432\",\"wallet_scheme\":\"DatabasePerWallet\"}" -e WALLET_STORAGE_CREDS="{\"account\":\"xxx\",\"password\":\"xxxx\",\"admin_account\":\"postgres\",\"admin_password\":\"xxxxx\"}" --name postgrearies1 取代了tidyr::pivot_longertidyr::gather语法。

根据我的经验,nest-unnestbroom::glance更可靠地提供测试统计信息,但是您可以同时尝试两者。

broom::tidy