为唯一列值子集化数据帧的行为

时间:2013-05-08 20:05:29

标签: r dataframe

背景:我有一个数据框,其中一列有重复值。我试图通过挑选具有重复列值的所有行来分割此数据框,处理它们然后吐出一个包含所有已处理行的新数据框。

我对以下代码中出现的问题感到惊讶:

    dataSet <- structure(list(DAY = structure(1:10, .Label = c("Tuesday", 
    "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
    "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
    "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
    "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
    "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
    "Tuesday", "Tuesday", "Tuesday"), class = "factor"), 
        variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L), .Label = c("act1", "act2", "act3", "act4", 
        "act5", "act12", "act19", "act116", "act22", 
        "act6", "act13", "act111", "act117", "act23", 
        "act7", "act14", "act112", "act118", "act24", 
        "act8", "act15", "act113", "act119", "act25", 
        "act9", "act16", "act114", "act20", "act26", 
        "act10", "act17", "act115", "act21", "act27", 
        "act11", "act18"), class = "factor"), value = c(67, 
        65, 40, 79, 106, 90, 57, 59, 2, 12)), .Names = c("DAY", 
    "variable", "value"), row.names = c(NA, 10L), class = "data.frame")


uniq <- unique(dataSet$variable)
for (i in 1:length(uniq)){
     rowsPerVal <- dataSet[dataSet$variable == uniq[i], ]
     print(length(rowsPerVal))
}

我只是不明白最终的print语句如何说长度为3,当数据框中有10条记录与variable列的值相同时。

2 个答案:

答案 0 :(得分:3)

plyr也适用于这种拆分 - 应用 - 合并问题(将数据拆分成块,对每一个进行操作,然后重新组合)。

library("plyr")
ddply(dataSet, .(variable), nrow)

正如其他人所说,length()的{​​{1}}是列数; data.frame是行数。

nrow()

您可以使用(匿名)函数替换> ddply(dataSet, .(variable), nrow) variable V1 1 act1 10 ,该函数执行您想要的任何处理。

答案 1 :(得分:1)

duplicated仅对第2个条目返回TRUE。所以你可以用它来索引你的行:

dataSet[duplicated(dataSet$variable),] 

您也可以分配给他们:

dataSet[duplicated(dataSet$variable),]$value <- NA 
> dataSet
       DAY variable value
1  Tuesday     act1    67
2  Tuesday     act1    NA
3  Tuesday     act1    NA
4  Tuesday     act1    NA
5  Tuesday     act1    NA
6  Tuesday     act1    NA
7  Tuesday     act1    NA
8  Tuesday     act1    NA
9  Tuesday     act1    NA
10 Tuesday     act1    NA

要“使用所有已处理的行吐出新的数据框”,您可以根据需要处理子集化的data.frame:

newDF <- transform( dataSet[duplicated(dataSet$variable),], DAY=sub("esd","foo",DAY) )