Question

我一直在努力在我的数据框架上获得一个shapiro-wilkes正态性假设检验p值表。这是数据框（名为“mdf1”）为comma-delimeted CSV。

Shapiro-Wilkes在R中进行测试需要大于3的样本大小。为了对我的数据框（包含两个相关因素，“变量”和“网站”）进行子集化，我使用了以下代码：

    Z <- as.data.frame(data.table(mdf1)[, list(freq=.N, value=value), by=list(Site,variable)][freq > 3])

这导致数据框“Z”包含属于大于3的“站点”*“变量”组合的所有值。然后，我尝试将Z传递给ddply函数获得一个shapiro-wilkes p值表：

    norm2 <- ddply(Z, .(Site, variable), summarize, n=length(value), sw=shapiro.test(value)[2])

此命令的结果是：

Error in shapiro.test(val) : all 'x' values are identical

怎么会这样？有什么想法吗？

Answer 1

您的value变量在此处为字符串。但??shapiro.test(x)表示x是数据值的数字向量...允许缺少值，但非缺失值的数量必须介于3到5000之间。代码的前两行与我对earlier问题的回答相同。

因此您可以使用以下代码（已测试）：

mydata$inter<-with(mydata,interaction(Site,variable))
mydata1<-mydata[mydata$inter %in% names(which(table(mydata$inter) > 3)), ]  
library(plyr) 
ddply(mydata1, .(inter), summarize, n=length(value),sw=shapiro.test(as.numeric(value))[2])

                inter  n        sw
1  41332.Effluent (N) 18 0.6294289
2  41369.Effluent (N) 18 0.6294289
3  41385.Effluent (N) 10  0.969692
4  41394.Effluent (N) 12 0.5272433
5  41402.Effluent (N) 12 0.4404443
6  41436.Effluent (N) 14 0.6283259
7  41439.Effluent (N)  6  0.484449
8  41450.Effluent (N)  5 0.5012284
9  41452.Effluent (N) 14 0.5331113
10 41457.Effluent (N) 12 0.5272433
11 41458.Effluent (N) 12 0.5272433
12 43635.Effluent (N)  7 0.7437188
13 41332.Effluent (S) 13 0.5331956
14 41369.Effluent (S)  7 0.4869206
15 41379.Effluent (S)  6  0.484449
16 41385.Effluent (S)  7 0.4869206
17 41394.Effluent (S) 12 0.5272433
18 41436.Effluent (S) 14 0.6283259
19 41332.Influent (N) 18 0.6294289
20 41369.Influent (N) 18 0.6294289
21 41385.Influent (N) 10  0.969692
22 41394.Influent (N) 12 0.5272433
23 41402.Influent (N) 12 0.4404443
24 41436.Influent (N) 14 0.6283259
25 41439.Influent (N)  6  0.484449
26 41450.Influent (N)  5 0.5012284
27 41452.Influent (N) 14 0.5331113
28 41457.Influent (N) 12 0.5272433
29 41458.Influent (N) 12 0.5272433
30 43635.Influent (N)  7 0.7437188
31 41332.Influent (S) 13 0.5331956
32 41369.Influent (S)  7 0.4869206
33 41379.Influent (S)  6  0.484449
34 41385.Influent (S)  7 0.4869206
35 41394.Influent (S) 12 0.5272433
36 41402.Influent (S) 12 0.4404443
37 41436.Influent (S) 14 0.6283259
38 41452.Influent (S)  7 0.6578695
39 41457.Influent (S)  7 0.6578695
40 41458.Influent (S)  8 0.7159932
41         41332.PLot  6  0.484449
42         41369.PLot  6  0.484449
43         41379.PLot  6  0.484449
44         41385.PLot  7 0.4869206
45         41394.PLot 12 0.5272433
46         41402.PLot 12 0.4404443
47         41452.PLot  7 0.6578695
48         41457.PLot  7 0.6578695
49         41458.PLot  8 0.7159932

Answer 2

错误消息听起来很简单。通常，如果列/变量中的所有值都相同，则意味着您肯定会收到错误消息。我建议您首先检查并删除方差为零的变量。这对我有用：

# Load dplyr
library(dplyr) # Data manipulation
library(caret) # nearZeroVar function

# Load your data and assign it the name, df
# *code to load data goes here*

# Etract names for numeric columns
numeric_vars<-df%>%select_if(is.numeric)%>%names()

# Extract zero variance variables first
zero_var_columns<-nearZeroVar(df, saveMetrics = TRUE)%>%
filter(zeroVar==TRUE)%>%
row.names()

# Show
zero_var_columns

# Drop zero variance columns
df<-df%>%
select(-zero_var_columns)

# Now test for normality
df%>%
select_if(is.numeric)%>%
sapply(shapiro.test)%>%
t()%>%
data.frame()%>%
select(p.value)%>%
mutate(Is_normally_distributed=p.value>=.05)

R中的Shapiro.test给出“所有x值都相同”？

2 个答案: