我正在尝试对数据集运行正态性测试(shapiro-wilk),我想同时对所有列进行统计和p值。我已经阅读了关于此的所有其他页面(R: Shapiro test by group won't produce p-values and corrupt data frame warning,Using shapiro.test on multiple columns in a data frame),但仍然无法弄明白。任何帮助将不胜感激!!
敌人,例如,这里是数据集:有一个字符向量(NVL),其余为数字,我想按NVL(NV / VL)进行分组。
NVL Var1 Var2 Var3 Var 4 Var 5
1. NV 22.5 26.8 89.2 35.7 100
2. NV 34.7 67.4 29.8 12.4 100
3. NV 68.3 34.5 44.5 23.8 100
4. NV 11.2 55.3 17.5 77.9 100
5. VL 55.6 77.2 59.7 89.6 100
6. VL 60.5 88.7 65.4 99.6 100
7. VL 89.4 87.5 65.9 89.5 100
8. VL 65.4 74.2 75.4 89.5 100
9. VL 81.8 78.5 95.4 92.5 100
以下是代码:
library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise(statistic = shapiro.test(Var1)$statistic,
p.value = shapiro.test(Var1)$p.value)
这是输出:
NVL statistic p.value
<chr> <dbl> <dbl>
1 VL 0.9125239 0.1985486
2 NV 0.8983501 0.2101248
现在,我编辑这段代码,以便我可以同时获得所有变量(Var2,3,4,5)的输出吗?我甚至尝试过聚合和祝福,但我被卡住了。
aggregate(formula = Var1 ~ NVL,
data = mydata,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
如您所见,我只能为一个变量执行此操作!我知道我很亲密,但我不能再弄清楚了!提前感谢您的帮助!!
答案 0 :(得分:2)
我建议构建一个长格式(&#34;整洁&#34;)数据集并使用tidiverse
函数。
加载套餐:
library(dplyr) # for data manipulation functions
library(tidyr) # for data manipulation functions
library(data.table) # for function `fread`
library(broom) # for function `tidy`
将数据读入R
:
data <- fread(
"NVL Var1 Var2 Var3 Var4 Var5
NV 22.5 26.8 89.2 35.7 100
NV 34.7 67.4 29.8 12.4 100
NV 68.3 34.5 44.5 23.8 50
NV 11.2 55.3 17.5 77.9 100
VL 55.6 77.2 59.7 89.6 100
VL 60.5 88.7 65.4 99.6 100
VL 89.4 87.5 65.9 89.5 100
VL 65.4 74.2 75.4 89.5 90
VL 81.8 78.5 95.4 92.5 90")
进行分析:
# 1. Gather the variables that values should be tested.
# 2. Group by variable with variable names (`variable_name`) and
# by all group variables (in our case `NVL`).
# 3. Do the test for `value` and tidy the result.
# 4. Ungroup (it's a good practice to do this).
# 5. Remove unnecessary information (column `method`).
sw_test_results <- data %>%
gather(key = "variable_name", value = "value", Var1:Var5) %>%
group_by(variable_name, NVL) %>%
do(tidy(shapiro.test(.$value))) %>%
ungroup() %>%
select(-method)
sw_test_results
结果:
# A tibble: 10 x 4
variable_name NVL statistic p.value
<chr> <chr> <dbl> <dbl>
1 Var1 NV 0.931 0.602
2 Var1 VL 0.915 0.498
3 Var2 NV 0.941 0.660
4 Var2 VL 0.874 0.282
5 Var3 NV 0.910 0.480
6 Var3 VL 0.864 0.245
7 Var4 NV 0.900 0.433
8 Var4 VL 0.726 0.0176
9 Var5 NV 0.630 0.00124
10 Var5 VL 0.684 0.00647
答案 1 :(得分:1)
mydata <- read.table(text="
NVL Var1 Var2 Var3 Var4 Var5
1 NV 22.5 26.8 89.2 35.7 100
2 NV 34.7 67.4 29.8 12.4 100
3 NV 68.3 34.5 44.5 23.8 50
4 NV 11.2 55.3 17.5 77.9 100
5 VL 55.6 77.2 59.7 89.6 100
6 VL 60.5 88.7 65.4 99.6 100
7 VL 89.4 87.5 65.9 89.5 100
8 VL 65.4 74.2 75.4 89.5 90
9 VL 81.8 78.5 95.4 92.5 90
", header=T)
library(dplyr)
myfun <- function(x, group) {
data.frame(x, group) %>%
group_by(group) %>%
summarise(
statistic = ifelse(sd(x)!=0,shapiro.test(x)$statistic,NA),
p.value = ifelse(sd(x)!=0,shapiro.test(x)$p.value,NA)
)
}
(lst <- lapply(mydata[,-1], myfun, group=mydata[,1]))
输出结果为:
$Var1
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9313476 0.6023421
2 VL 0.9149572 0.4979450
$Var2
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9409576 0.6601747
2 VL 0.8736587 0.2815562
$Var3
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9096322 0.4804557
2 VL 0.8644349 0.2446131
$Var4
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9003135 0.43261822
2 VL 0.7260939 0.01760713
$Var5
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.6297763 0.001240726
2 VL 0.6840289 0.006470001
lst
输出列表可以转换为data.frame
对象:
do.call(cbind, lst)[,-seq(4,3*(ncol(mydata)-1),3)]
这是输出:
Var1.group Var1.statistic Var1.p.value Var2.statistic Var2.p.value Var3.statistic Var3.p.value Var4.statistic Var4.p.value Var5.statistic Var5.p.value
1 NV 0.9313476 0.6023421 0.9409576 0.6601747 0.9096322 0.4804557 0.9003135 0.43261822 0.6297763 0.001240726
2 VL 0.9149572 0.4979450 0.8736587 0.2815562 0.8644349 0.2446131 0.7260939 0.01760713 0.6840289 0.006470001
答案 2 :(得分:1)
只需使用summarise_all
:
mydata <- read.table(text="
NVL Var1 Var2 Var3 Var4 Var5
1 NV 22.5 26.8 89.2 35.7 100
2 NV 34.7 67.4 29.8 12.4 100
3 NV 68.3 34.5 44.5 23.8 50
4 NV 11.2 55.3 17.5 77.9 100
5 VL 55.6 77.2 59.7 89.6 100
6 VL 60.5 88.7 65.4 99.6 100
7 VL 89.4 87.5 65.9 89.5 100
8 VL 65.4 74.2 75.4 89.5 90
9 VL 81.8 78.5 95.4 92.5 90
", header=T)
library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise_all(.funs = funs(statistic = shapiro.test(.)$statistic,
p.value = shapiro.test(.)$p.value))
使用所需的输出:
normalityVar1
# A tibble: 2 x 11
NVL Var1_statistic Var2_statistic Var3_statistic Var4_statistic Var5_statistic Var1_p.value Var2_p.value Var3_p.value
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NV 0.9313476 0.9409576 0.9096322 0.9003135 0.6297763 0.6023421 0.6601747 0.4804557
2 VL 0.9149572 0.8736587 0.8644349 0.7260939 0.6840289 0.4979450 0.2815562 0.2446131
# ... with 2 more variables: Var4_p.value <dbl>, Var5_p.value <dbl>
请注意,您首先拥有所有统计信息,然后是所有p值。如果需要,对列进行重新排序应该很简单。
答案 3 :(得分:0)
基于@GegznaV的出色回答,
这是一个更新,其中更新的结构docker run -d -p 5001:5000 -p 10001:10000 -e WALLET_STORAGE_CONFIG="{\"url\":\"xx.xx.xxx.xxx:5432\",\"wallet_scheme\":\"DatabasePerWallet\"}" -e WALLET_STORAGE_CREDS="{\"account\":\"xxx\",\"password\":\"xxxx\",\"admin_account\":\"postgres\",\"admin_password\":\"xxxxx\"}" --name postgrearies1
取代了tidyr::pivot_longer
和tidyr::gather
语法。
根据我的经验,nest-unnest
比broom::glance
更可靠地提供测试统计信息,但是您可以同时尝试两者。
broom::tidy