我的数据框如下所示:
ID | value A | value B
1 | A1 | F
1 | A2 | N
1 | A3 | B
1 | A4 | S
2 | A1 | B
2 | A2 | G
2 | A3 | N
3 | A1 | F
3 | A2 | H
3 | A3 | J
3 | A4 | N
所以每个ID有4行。我正在尝试使用dcast()函数,但它仅在所有ID具有相同行数时才有效。在该示例中,ID号2将是错误情况。有没有简单的方法来查找多于或少于4行的所有ID? 或者可能有什么办法让dcast函数忽略错误情况?
最初我正在尝试重塑数据帧以获得类似的结果:
ID | A1 | A2 | A3 | A4
1 | F | N | B | S
2 | B | G | N | NA
3 | F | H | J | N
显然,reshape2包中的dcast()函数不适用于不规则的ID。它给了我以下错误信息:'聚合函数缺失:默认为长度'但是我的数据集的一小部分 - 没有那些不规则的iD - 它可以工作。有任何想法吗? 或者可能想到如何在不使用dcast的情况下重塑我的数据帧?谢谢!
我正在使用以下(package-)版本的mac上工作:
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.2.1 plyr_1.7.1
loaded via a namespace (and not attached):
[1] stringr_0.6
第一列值都是整数,其他字符值。
sapply(x, class)
ID fach01 f01_lp
"integer" "character" "character"
至于可重复的例子: 我希望这有帮助(我使用了我的原始数据帧),但是如果我只使用数据帧的前500行dcast()工作得非常好,那么当我尝试使用大约140000行的整个数据帧时会出现问题。
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L), A = c("2.LF",
"1.LF", "3.PF", "4.PF", "3.PF", "1.LF", "2.LF", "3.PF",
"4.PF", "1.LF", "2.LF", "3.PF", "1.LF", "4.PF", "2.LF", "1.LF",
"2.LF", "4.PF", "3.PF", "1.LF", "3.PF", "2.LF", "4.PF", "3.PF",
"4.PF", "1.LF", "2.LF", "4.PF", "2.LF", "3.PF", "1.LF", "1.LF",
"2.LF", "3.PF", "4.PF"), B = c("Mu/Ku",
"Fs", "2.AF", "NW", "DE", "2.AF", "MA", "Fs", "2.AF", "NW",
"NW", "Fs", "2.AF", "bel", "NW", "Fs", "bel", "bel", "NW", "DE",
"2.AF", "2.AF", "MA", "Fs", "2.AF", "MA", "NW", "DE", "2.AF",
"MA", "NW", "Mu/Ku", "Fs", "2.AF", "NW")), .Names = c("ID", "A", "B"
), row.names = c("3", "5", "7", "10", "26", "29", "212", "213",
"32", "35", "38", "39", "43", "44", "45", "48", "53", "56", "57",
"59", "61", "65", "67", "68", "72", "75", "76", "77", "81", "86",
"87", "88", "92", "93", "95", "98"), class = "data.frame")
在我的原始数据框中,值A1-A4(此处称为1.PF - 4.PF)的顺序不正确,这是我想要dcast做的(与上面相同)
ID | 1.PF | 2.PF | 3.PF | 4.PF
1 | F | NW | DE | S
2 | bel | G | N | <NA>
3 | F | NW | bel | N
编辑:
我没有解决dcast()问题,但我找到了解决它的方法:(来自reshape包的reshape()函数)
df <- reshape(df, idvar='ID', varying = NULL, timevar = 'value A', direction='wide')
答案 0 :(得分:2)
您应该提到dcast
来自reshape2
包(不是基础R的一部分)。我不确定你要用它做什么,但这应该做你要求的。
弥补数据:
id <- rep(1:3,c(4,3,4))
d <- data.frame(id)
d <- ddply(d,.(id),
function(x) {
transform(x,A=paste("A",seq(nrow(x)),sep=""),
B=sample(LETTERS,nrow(x),replace=TRUE))
})
识别'坏'组:
idtab <- table(d$id)
d2 <- d[!id %in% names(idtab)[idtab<4],]
虽然我可以这样做,但如果我尝试使用完整的数据集,dcast
执行“正确”的操作(即我希望的内容以及您想要的内容),并填写NA
的缺失值;我没有收到错误(我在R的开发版本下使用reshape2
v 0.8.4。
library(reshape2)
使用经过清理的数据:
dcast(d2,id~A)
# Using B as value column: use value.var to override.
# id A1 A2 A3 A4
# 1 1 B X P E
# 2 3 F Q H B
使用原始数据:
dcast(d,id~A)
# Using B as value column: use value.var to override.
# id A1 A2 A3 A4
# 1 1 B X P E
# 2 2 I N H <NA>
# 3 3 F Q H B
答案 1 :(得分:2)
table
和which
肯定是第一个问题的答案:
names(table(dfrm$ID))[which(table(dfrm$ID) <4)]
#[1] "2"
至于第二个问题,也许你应该发布产生错误的代码。目前还不清楚你在尝试(和失败)做什么。
编辑:
如果我将因子变量转换为字符变量,我可以让dcast返回正确的对象,尽管我的错误与你的不同。在Mac上,我在R 2.14.1上的reshape 1.1和reshape 1.2.1中都出现了错误。
EDIT2:事实证明,错误已在最新版本的plyr中得到修复。使用plyr 1.7运行reshape 1.2.1时没有错误。您还应该更新这两个软件包,然后重新启动。
require(reshape2)
dfrm <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3), value.A = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L), .Label = c(" A1 ",
" A2 ", " A3 ", " A4 "), class = "factor"), value.B = structure(c(2L,
6L, 1L, 7L, 1L, 3L, 6L, 2L, 4L, 5L, 6L), .Label = c(" B", " F",
" G", " H", " J", " N", " S"), class = "factor")), .Names = c("ID",
"value.A", "value.B"), class = "data.frame", row.names = c(NA,
-11L))
dcast(dfrm2, ID ~ value.A)
# Using value.B as value column: use value_var to override.
# Error in names(data) <- array_names(res$labels[[2]]) :
# 'names' attribute [4] must be the same length as the vector [1]
# I first tried removing the leading and trainly spaces with:
dfrm2 <- data.frame(lapply(dfrm, gsub, patt="^\\s+|\\s+$", rep=""))
# Still got the error. Now try to leave as "character" type.
dfrm2 <- data.frame(lapply(dfrm, gsub, patt="^\\s+|\\s+$", rep=""),stringsAsFactors=FALSE)
str(dfrm2)
#-----------------
'data.frame': 11 obs. of 3 variables:
$ ID : chr "1" "1" "1" "1" ...
$ value.A: chr "A1" "A2" "A3" "A4" ...
$ value.B: chr "F" "N" "B" "S" ...
dcast(dfrm2, ID ~ value.A)
#------------------
Using value.B as value column: use value_var to override.
ID A1 A2 A3 A4
1 1 F N B S
2 2 B G N <NA>
3 3 F H J N
答案 2 :(得分:1)
试试tapply
。 (如果第三列已经character
,而不是factor
,则as.character
可以省略):
tapply(as.character(DF[,3]), DF[-3], c)