plyr是否会跳过因子[即分组变量]的缺失级别?这是我诊断问题的第一个问题。
我有一个数据集,患者在strata=rural
或strata=city
。我想将age
中的treatment=A
与treatment=B
进行比较。
例如,我正在尝试:
ddply(data.c, .(strata), function(x) t.test(age~treatment, data=x, na.rm=TRUE)
但它告诉我
t.test.formula中的错误(年龄〜治疗,数据= x,na.rm = TRUE): 分组因子必须正好有2个级别
如果我运行factor(data$strata)
和factor(data$treatment)
我分别只看到两个级别(它们分别是两个标签;这不是问题,对吧?)。
plyr是否认为NA是分组因子中的一个级别?错误消息可能出现什么问题?
我一直在谷歌搜索并查看stackoverflow来回答这个问题。我对R很新,但我找不到答案。非常感谢任何帮助。
所以我有一些示例代码。 Chase的例子工作得很好但是当我使用我的数据时,我仍然得到同样的错误。
dlply(data.c, .(strata), function(x) t.test(age~treatment, data=x, na.rm=TRUE
产生了同样的错误。
我不明白为什么会这样。
以下是我的data.c
strata age treatment
1 1 57.8630136986301 p_3
2 0 52.958904109589 p_3
3 NA 97.2438356164384 p12
4 0 88.2027397260274 p12
5 1 77.5890410958904 p12
6 1 43.9123287671233 p12
7 1 63.5260273972603 p_3
8 1 42.1890410958904 p12
9 0 52.1753424657534 p12
10 1 65.6493150684932 p12
11 0 44.9835616438356 p12
12 1 64.8849315068493 p12
13 1 57.5835616438356 p12
14 0 47.3013698630137 p12
15 0 74.0356164383562 NA
16 0 65.4986301369863 p12
17 1 83.986301369863 p12
18 0 47.4904109589041 p12
19 1 47.8630136986301 p12
20 1 58.8520547945205 p12
21 1 61.3342465753425 p12
22 1 66.841095890411 p12
23 1 55.6383561643836 p12
24 1 52.7178082191781 p12
25 1 71.4630136986301 p12
26 1 NA p12
27 1 59.2082191780822 p12
28 1 69.8575342465753 p12
29 1 46.7397260273973 p12
30 1 53.5013698630137 p_3
31 1 41.3205479452055 p12
32 0 51.3917808219178 p_3
33 1 47.8684931506849 p12
34 1 87.654794520548 p12
35 0 75.558904109589 p12
36 1 71.2520547945205 p12
37 1 44.9808219178082 p_3
38 1 52 p_3
39 1 54.7643835616438 p_3
40 1 85.6630136986301 p_3
41 1 74.1835616438356 p_3
42 1 56.8684931506849 p_3
43 1 87.572602739726 p_3
44 1 85.0109589041096 p_3
45 1 70.0767123287671 p_3
46 0 47.2328767123288 p12
47 1 63.7972602739726 p12
48 1 85.8054794520548 p12
49 1 67.027397260274 p12
50 1 60.7342465753425 p12
51 0 61.9479452054794 p_3
52 0 86.8712328767123 p12
53 1 87.8219178082192 p12
54 1 49.9424657534247 p12
55 0 83.386301369863 p12
56 1 88.3013698630137 p12
57 1 55.7890410958904 p12
58 1 63.7616438356164 p12
59 1 55.5041095890411 p12
60 1 43.5232876712329 p12
61 1 58.8246575342466 p12
62 0 46.7397260273973 p12
63 0 74.2027397260274 p_3
64 0 51.9205479452055 p_3
65 0 78.1890410958904 p_3
66 0 78.9917808219178 p_3
答案 0 :(得分:2)
我认为您需要向我们提供一些示例数据,因为我无法复制您的错误。我收到有关ddply()
抱怨以下内容的错误:
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results must be all atomic, or all data frames
这是因为t.test()
的输出不是data.frame,而是类htest
的列表。因此,简单地告诉plyr返回列表对象而不是修复问题。如果要从列表中提取特定值,则可能会返回data.frame。
以下是我设置数据的方式:
x <- data.frame(strata = sample(letters[1:2], 100, TRUE),
age = sample(18:65, 100, TRUE),
treatment = sample(LETTERS[1:2], 100, TRUE))
x[sample(nrow(x), 10, FALSE), "strata"] <- NA
x[sample(nrow(x), 10, FALSE), "treatment"] <- NA
然后只需调用函数dlply()
即可:
out <- dlply(x, .(strata), function(x) t.test(age~treatment, data = x, na.rm=TRUE))
要回答你的问题,是的 - plyr似乎将NA视为一个有效的组,正如out
列表对象的长度和名称所证明的那样:
> names(out)
[1] "a" "b" "NA"
正如我所说,更好地描述你的问题可能会得到一个更好的答案 - 但希望这会让你走上正确的道路。
答案 1 :(得分:1)
plyr
会将缺失视为一个单独的组。因此,在您的示例中,有三个组:0
,1
和NA
。您可以通过添加.inform
参数来查看哪些组引发了错误:
dlply(data.c,
.(strata),
function(x) t.test(age~treatment, data=x, na.rm=TRUE),
.inform=TRUE)
报道:
Error in t.test.formula(age ~ treatment, data = x, na.rm = TRUE) :
grouping factor must have exactly 2 levels
Error: with piece 3:
strata age treatment
3 NA 97.24384 p12
事实上只有一行strata
NA
,并且在单个数据点上执行t.test
(或者更确切地说,所有数据点都在同一个数据点上group)会给出你看到的错误:
> t.test(age~treatment, data=data.c[is.na(data.c$strata),], na.rm=TRUE)
Error in t.test.formula(age ~ treatment, data = data.c[is.na(data.c$strata), :
grouping factor must have exactly 2 levels
要解决此问题,您可以在data.c
电话中对dlply
进行分组:
dlply(data.c[!is.na(data.c$strata),],
.(strata),
function(x) t.test(age~treatment, data=x, na.rm=TRUE))
(假设您知道解决问题的方法)并给出:
$`0`
Welch Two Sample t-test
data: age by treatment
t = 0.2653, df = 15.729, p-value = 0.7942
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.41565 17.24792
sample estimates:
mean in group p_3 mean in group p12
64.22896 62.31283
$`1`
Welch Two Sample t-test
data: age by treatment
t = 0.6606, df = 18.612, p-value = 0.517
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.998471 13.440397
sample estimates:
mean in group p_3 mean in group p12
65.50091 62.27995
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
strata
1 0
2 1
或者您可以通过错误捕获来保护呼叫。 plyr
为此目的提供了便利包装try_default
。
dlply(data.c,
.(strata),
function(x) try_default(t.test(age~treatment, data=x, na.rm=TRUE),
NULL))
给出了
Error in t.test.formula(age ~ treatment, data = x, na.rm = TRUE) :
grouping factor must have exactly 2 levels
$`0`
Welch Two Sample t-test
data: age by treatment
t = 0.2653, df = 15.729, p-value = 0.7942
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.41565 17.24792
sample estimates:
mean in group p_3 mean in group p12
64.22896 62.31283
$`1`
Welch Two Sample t-test
data: age by treatment
t = 0.6606, df = 18.612, p-value = 0.517
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.998471 13.440397
sample estimates:
mean in group p_3 mean in group p12
65.50091 62.27995
$`NA`
NULL
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
strata
1 0
2 1
3 NA
请注意,此输出有3个元素,其中一个元素用于NA
,由于NULL
而设置为try_default
。