R dcast错误/在数据帧中查找不规则ID

时间:2012-01-15 16:28:53

标签: r find dataframe

我的数据框如下所示:

ID | value A | value B
1  |   A1    |   F
1  |   A2    |   N
1  |   A3    |   B
1  |   A4    |   S
2  |   A1    |   B
2  |   A2    |   G
2  |   A3    |   N
3  |   A1    |   F
3  |   A2    |   H
3  |   A3    |   J
3  |   A4    |   N

所以每个ID有4行。我正在尝试使用dcast()函数,但它仅在所有ID具有相同行数时才有效。在该示例中,ID号2将是错误情况。有没有简单的方法来查找多于或少于4行的所有ID? 或者可能有什么办法让dcast函数忽略错误情况?

最初我正在尝试重塑数据帧以获得类似的结果:

ID | A1 | A2 | A3 | A4
 1 | F  | N  | B  | S 
 2 | B  | G  | N  | NA
 3 | F  | H  | J  | N

显然,reshape2包中的dcast()函数不适用于不规则的ID。它给了我以下错误信息:'聚合函数缺失:默认为长度'但是我的数据集的一小部分 - 没有那些不规则的iD - 它可以工作。有任何想法吗? 或者可能想到如何在不使用dcast的情况下重塑我的数据帧?谢谢!

我正在使用以下(package-)版本的mac上工作:

sessionInfo() 
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.1 plyr_1.7.1    

loaded via a namespace (and not attached):
[1] stringr_0.6

第一列值都是整数,其他字符值。

sapply(x, class)
         ID      fach01      f01_lp 
  "integer" "character" "character" 

至于可重复的例子: 我希望这有帮助(我使用了我的原始数据帧),但是如果我只使用数据帧的前500行dcast()工作得非常好,那么当我尝试使用大约140000行的整个数据帧时会出现问题。

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L),  A = c("2.LF", 
"1.LF", "3.PF", "4.PF", "3.PF", "1.LF", "2.LF", "3.PF", 
"4.PF", "1.LF", "2.LF", "3.PF", "1.LF", "4.PF", "2.LF", "1.LF", 
"2.LF", "4.PF", "3.PF", "1.LF", "3.PF", "2.LF", "4.PF", "3.PF", 
"4.PF", "1.LF", "2.LF", "4.PF", "2.LF", "3.PF", "1.LF", "1.LF", 
"2.LF", "3.PF", "4.PF"), B = c("Mu/Ku", 
"Fs", "2.AF", "NW", "DE", "2.AF", "MA", "Fs", "2.AF", "NW", 
"NW", "Fs", "2.AF", "bel", "NW", "Fs", "bel", "bel", "NW", "DE", 
"2.AF", "2.AF", "MA", "Fs", "2.AF", "MA", "NW", "DE", "2.AF", 
"MA", "NW", "Mu/Ku", "Fs", "2.AF", "NW")), .Names = c("ID", "A", "B"
), row.names = c("3", "5", "7", "10", "26", "29", "212", "213", 
"32", "35", "38", "39", "43", "44", "45", "48", "53", "56", "57", 
"59", "61", "65", "67", "68", "72", "75", "76", "77", "81", "86", 
"87", "88", "92", "93", "95", "98"), class = "data.frame")

在我的原始数据框中,值A1-A4(此处称为1.PF - 4.PF)的顺序不正确,这是我想要dcast做的(与上面相同)

ID | 1.PF | 2.PF | 3.PF | 4.PF
 1 | F    | NW   | DE   | S 
 2 | bel  | G    | N    | <NA>
 3 | F    | NW   | bel  | N

编辑:

我没有解决dcast()问题,但我找到了解决它的方法:(来自reshape包的reshape()函数)

df <- reshape(df, idvar='ID', varying = NULL, timevar = 'value A', direction='wide')      

3 个答案:

答案 0 :(得分:2)

您应该提到dcast来自reshape2包(不是基础R的一部分)。我不确定你要用它做什么,但这应该做你要求的。

弥补数据:

id <- rep(1:3,c(4,3,4))
d <- data.frame(id)
d <- ddply(d,.(id),
           function(x) {
             transform(x,A=paste("A",seq(nrow(x)),sep=""),
                       B=sample(LETTERS,nrow(x),replace=TRUE))
           })

识别'坏'组:

idtab <- table(d$id)
d2 <- d[!id %in% names(idtab)[idtab<4],]

虽然我可以这样做,但如果我尝试使用完整的数据集,dcast执行“正确”的操作(即我希望的内容以及您想要的内容),并填写NA的缺失值;我没有收到错误(我在R的开发版本下使用reshape2 v 0.8.4。

library(reshape2)

使用经过清理的数据:

dcast(d2,id~A)
# Using B as value column: use value.var to override.
#   id A1 A2 A3 A4
# 1  1  B  X  P  E
# 2  3  F  Q  H  B

使用原始数据:

dcast(d,id~A)
# Using B as value column: use value.var to override.
#   id A1 A2 A3   A4
# 1  1  B  X  P    E
# 2  2  I  N  H <NA>
# 3  3  F  Q  H    B

答案 1 :(得分:2)

tablewhich肯定是第一个问题的答案:

 names(table(dfrm$ID))[which(table(dfrm$ID) <4)]
#[1] "2"

至于第二个问题,也许你应该发布产生错误的代码。目前还不清楚你在尝试(和失败)做什么。

编辑:

如果我将因子变量转换为字符变量,我可以让dcast返回正确的对象,尽管我的错误与你的不同。在Mac上,我在R 2.14.1上的reshape 1.1和reshape 1.2.1中都出现了错误。

EDIT2:事实证明,错误已在最新版本的plyr中得到修复。使用plyr 1.7运行reshape 1.2.1时没有错误。您还应该更新这两个软件包,然后重新启动。

require(reshape2)
dfrm <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3), value.A = structure(c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L), .Label = c("   A1    ", 
"   A2    ", "   A3    ", "   A4    "), class = "factor"), value.B = structure(c(2L, 
6L, 1L, 7L, 1L, 3L, 6L, 2L, 4L, 5L, 6L), .Label = c("   B", "   F", 
"   G", "   H", "   J", "   N", "   S"), class = "factor")), .Names = c("ID", 
"value.A", "value.B"), class = "data.frame", row.names = c(NA, 
-11L))
dcast(dfrm2, ID ~ value.A)
# Using value.B as value column: use value_var to override.
# Error in names(data) <- array_names(res$labels[[2]]) : 
#  'names' attribute [4] must be the same length as the vector [1]
# I first tried removing the leading and trainly spaces with:
dfrm2 <- data.frame(lapply(dfrm, gsub, patt="^\\s+|\\s+$", rep=""))
# Still got the error. Now try to leave as "character" type.

dfrm2 <- data.frame(lapply(dfrm, gsub, patt="^\\s+|\\s+$", rep=""),stringsAsFactors=FALSE)
str(dfrm2)
#-----------------
'data.frame':   11 obs. of  3 variables:
 $ ID     : chr  "1" "1" "1" "1" ...
 $ value.A: chr  "A1" "A2" "A3" "A4" ...
 $ value.B: chr  "F" "N" "B" "S" ...

dcast(dfrm2, ID ~ value.A)
#------------------
Using value.B as value column: use value_var to override.
  ID A1 A2 A3   A4
1  1  F  N  B    S
2  2  B  G  N <NA>
3  3  F  H  J    N

答案 2 :(得分:1)

试试tapply。 (如果第三列已经character,而不是factor,则as.character可以省略):

tapply(as.character(DF[,3]), DF[-3], c)