检查csv的行格式

时间:2013-07-16 18:12:39

标签: r data-structures formatting subset

我正在尝试导入一些数据(下面)并检查我是否有适当的行数供以后​​分析。

repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7", 
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838, 
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176, 
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938, 
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553, 
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L, 
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L, 
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L), 
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L, 
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName", 
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA, 
-22L))

数据给出5列,是我通常导入的一些数据的一个示例(通过.csv文件)。如您所见,“QueueName”列中有三个唯一值。对于“QueueName”中的每个唯一值,我想检查它是否有9行,或“X8Tile”列中的相应值(平均值,1,2,3,4,5,6,7,8) )。例如,“QueueName”总体包含所有必需的行,但usci_helpdesk没有。

所以我的优先级至少要确定“QueueName”中的一个唯一值是否没有所有必需的行。

我的第二优先级将删除与不符合要求的唯一“QueueName”对应的所有行。

2 个答案:

答案 0 :(得分:0)

使用在plyr包中实现的Split-Apply-Combine范例可以轻松解决这两个优先级。

优先级1:确定没有足够行数的QueueName的值

require(plyr)

# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)

如果您有许多QueueName的唯一值,则需要识别不等于9的值:

rowSummary[rowSummary$numRows !=9, ] 

优先级2:消除QueueName没有足够行

的行
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)

(我不太明白'检查它有9行,或“X8Tile”列中的相应值的含义)。您可以根据需要修改repexampleEdit行。

答案 1 :(得分:0)

这是一种对数据排序方式做出一些假设的方法。如果假设不合适,可以修改它(或者你的数据可以重新排序):

## Paste together the values from your "X8tile" column
##   If all is in order, you should have "Average12345678"
##   If anything is missing, you won't....
myMatch <- names(
  which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x) 
    gsub("^\\s+|\\s+$", "", paste(x, collapse = "")))) 
        == "Average12345678"))

## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
#                  QueueName   X8Tile   Actual Calls Pop
# 1                  Overall  Average 508.1822 54948  41
# 2                  Overall        1 334.6995  6896   6
# 3                  Overall        2 404.9049  8831   5
# 4                  Overall        3 469.4069  7825   5
# 5                  Overall        4 489.2800  5768   5
# 6                  Overall        5 516.5744  7943   5
# 7                  Overall        6 551.7966  5796   5
# 8                  Overall        7 601.5104  8698   5
# 9                  Overall        8 720.9811  3191   5
# 14 CCM4.usci_retention_eng  Average 535.2467   248  11
# 15 CCM4.usci_retention_eng        1 278.2500    11   2
# 16 CCM4.usci_retention_eng        2 409.9286     9   2
# 17 CCM4.usci_retention_eng        3 511.6635    94   2
# 18 CCM4.usci_retention_eng        4 553.0000     1   1
# 19 CCM4.usci_retention_eng        5 641.0000    65   1
# 20 CCM4.usci_retention_eng        6 676.1111     9   1
# 21 CCM4.usci_retention_eng        7 778.5517    29   1
# 22 CCM4.usci_retention_eng        8 886.3667    30   1

可以使用aggregate + merge和类似工具采取类似方法。