拆分变量名称并将数据与该拆分一起分割为R

时间:2015-06-28 18:58:19

标签: r string data-cleansing

我有一些我要解析的perfmon(Windows性能日志数据)数据。

通常一组列名如下所示:

> colnames(p)
[1] "Time"                                                         
[2] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length"      
[3] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length" 
[4] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length"
[5] "\\\\testdb1\\Processor(_Total)\\% Processor Time"             
[6] "\\\\testdb1\\System\\Processes"                               
[7] "\\\\testdb1\\System\\Processor Queue Length"   

我将这些数据输入R的方式是:

p <- read.csv("r-perfmon.csv",stringsAsFactors = FALSE, check.names = FALSE)

以下是一些示例数据

> head(p)
                     Time \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length
1 04/15/2013 00:00:19.279                                             0.040037563
2 04/15/2013 00:00:34.279                                             0.009740260
3 04/15/2013 00:00:49.275                                             0.011009828
4 04/15/2013 00:01:04.284                                             0.006016244
5 04/15/2013 00:01:19.279                                             0.015125328
6 04/15/2013 00:01:34.275                                             0.002814141
  \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length
1                                                  0.001421333
2                                                  0.000000000
3                                                  0.000206726
4                                                  0.000000000
5                                                  0.001894000
6                                                  0.000000000
  \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length
1                                                   0.038616230
2                                                   0.009740260
3                                                   0.010803102
4                                                   0.006016244
5                                                   0.013231327
6                                                   0.002814141
  \\\\testdb1\\Processor(_Total)\\% Processor Time \\\\testdb1\\System\\Processes
1                                        29.569339                             86
2                                        10.856994                             86
3                                         7.733924                             81
4                                         1.910202                             81
5                                         6.164864                             81
6                                         1.351883                             81
  \\\\testdb1\\System\\Processor Queue Length
1                                           0
2                                           0
3                                           0
4                                           0
5                                           0
6                                           0

我希望能够解析列名,然后融化数据。

因此,如果我们以一列数据为例

> example <- p[2]
> head(example)
  \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length
1                                             0.040037563
2                                             0.009740260
3                                             0.011009828
4                                             0.006016244
5                                             0.015125328
6                                             0.002814141

我希望它看起来像这样

Time, MachineName, Object, Counter, InstanceName, Value
04/15/2013 00:00:19.279, testdb1, PhysicalDisk, Avg. Disk Queue Length, 0 C:, 0.040037563
04/15/2013 00:00:34.279, testdb1, PhysicalDisk, Avg. Disk Queue Length, 0 C:, 0.009740260
04/15/2013 00:00:49.275, testdb1, PhysicalDisk, Avg. Disk Queue Length, 0 C:, 0.011009828

编辑:根据要求输入我的数据头

structure(list(`(PDH-CSV 4.0) (GMT Daylight Time)(-60)` = c("04/15/2013 00:00:19.279", 
"04/15/2013 00:00:34.279", "04/15/2013 00:00:49.275", "04/15/2013 00:01:04.284", 
"04/15/2013 00:01:19.279", "04/15/2013 00:01:34.275"), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length` = c(0.040037563, 
0.00974026, 0.011009828, 0.006016244, 0.015125328, 0.002814141
), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length` = c(0.001421333, 
0, 0.000206726, 0, 0.001894, 0), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length` = c(0.03861623, 
0.00974026, 0.010803102, 0.006016244, 0.013231327, 0.002814141
), `\\\\testdb1\\Processor(_Total)\\% Processor Time` = c(29.56933862, 
10.85699395, 7.733924001, 1.910202013, 6.164864178, 1.351882837
), `\\\\testdb1\\System\\Processes` = c(86L, 86L, 81L, 81L, 81L, 
81L), `\\\\testdb1\\System\\Processor Queue Length` = c(0L, 0L, 0L, 
0L, 0L, 0L)), .Names = c("(PDH-CSV 4.0) (GMT Daylight Time)(-60)", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length", "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length", 
"\\\\testdb1\\Processor(_Total)\\% Processor Time", "\\\\testdb1\\System\\Processes", 
"\\\\testdb1\\System\\Processor Queue Length"), row.names = c(NA, 
6L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

有点难以知道您的最终数据应该是什么样子,就像每个列名称被反斜杠或括号分开一样,根据输入列,结果中会得到不同数量的列。

所以我将每列拆分成一个单独的列表元素。如果dput中的data.frame被称为d

# Look at second column - then all you need to do is tweak the names
s <- strsplit(colnames(d)[2], "\\\\|\\)|\\(")[[1]]
data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[2]])

                     time      X1           X2   X3                     X4       value
1 04/15/2013 00:00:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.040037563
2 04/15/2013 00:00:34.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.009740260
3 04/15/2013 00:00:49.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.011009828
4 04/15/2013 00:01:04.284 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.006016244
5 04/15/2013 00:01:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.015125328
6 04/15/2013 00:01:34.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.002814141

strsplit将每个字符串拆分为\\() - 请注意,在R中,这些字符串需要使用前导\\进行转义。这导致一些空字符串被nzchar函数删除(如果零长度则返回FALSE)

# Apply it over all variables
lapply(seq_along(colnames(d))[-1], function(i) {
                 s <- strsplit(colnames(d)[[i]], "\\\\|\\)|\\(")[[1]]
                 data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[i]])
})

同样,您需要重命名列。