我想阅读一个文件,其中每一行代表一个包含日期,一些文本和数字的数据集。例如:
Fri Dec 11 12:40:01 CET 2015 Uptime: 108491 Threads: 2 Questions: 576603 Slow queries: 10 Opens: 2238 Flush tables: 1 Open tables: 7 Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015 Uptime: 109090 Threads: 2 Questions: 580407 Slow queries: 10 Opens: 2253 Flush tables: 1 Open tables: 6 Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015 Uptime: 109690 Threads: 2 Questions: 583895 Slow queries: 10 Opens: 2268 Flush tables: 1 Open tables: 8 Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015 Uptime: 110290 Threads: 1 Questions: 586891 Slow queries: 10 Opens: 2279 Flush tables: 1 Open tables: 6 Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015 Uptime: 110890 Threads: 2 Questions: 590871 Slow queries: 10 Opens: 2292 Flush tables: 1 Open tables: 5 Queries per second avg: 5.328
没有常规分隔字符(如CSV格式),但格式可以很好地描述,因为可以使用制表符,字符和文本。
%DATESTRING%\tUptime: %uptime% Threads: %threads% Questions: %questions% Slow queries: %slow% Opens: %opens% Flush tables: %flush% Open tables: %otables% Queries per second avg: %qps%
是否有一个函数可以获取格式和文件的描述,并使用给定的数据填充data.frame。?
答案 0 :(得分:0)
软件包tidyr
有一些实用功能可能对此有用,但如果为此工作构建了更多专用工具,我不会感到惊讶。
我们首先加载数据,在本例中是从字符串
raw <- 'Fri Dec 11 12:40:01 CET 2015 Uptime: 108491 Threads: 2 Questions: 576603 Slow queries: 10 Opens: 2238 Flush tables: 1 Open tables: 7 Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015 Uptime: 109090 Threads: 2 Questions: 580407 Slow queries: 10 Opens: 2253 Flush tables: 1 Open tables: 6 Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015 Uptime: 109690 Threads: 2 Questions: 583895 Slow queries: 10 Opens: 2268 Flush tables: 1 Open tables: 8 Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015 Uptime: 110290 Threads: 1 Questions: 586891 Slow queries: 10 Opens: 2279 Flush tables: 1 Open tables: 6 Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015 Uptime: 110890 Threads: 2 Questions: 590871 Slow queries: 10 Opens: 2292 Flush tables: 1 Open tables: 5 Queries per second avg: 5.328'
df <- read.csv(textConnection(raw), header=F)
我已使用read.csv
这样我将其作为数据框使用,但您也可以使用readLines
并自行将其添加到框架中。
然后我们处理它
library(tidyr)
> processed <- df %>% extract(V1,
c("Date", "Uptime", "Threads", "Questions"),
"(.*) *Uptime: (\\d+) *Threads: (\\d+) *Questions: (\\d+)")
> processed
Date Uptime Threads Questions
1 Fri Dec 11 12:40:01 CET 2015 108491 2 576603
2 Fri Dec 11 12:50:01 CET 2015 109090 2 580407
3 Fri Dec 11 13:00:01 CET 2015 109690 2 583895
4 Fri Dec 11 13:10:01 CET 2015 110290 1 586891
5 Fri Dec 11 13:20:01 CET 2015 110890 2 590871
应该清楚如何从这里提取剩余的列。
答案 1 :(得分:0)
还有两个选择:
txt <- "Fri Dec 11 12:40:01 CET 2015 Uptime: 108491 Threads: 2 Questions: 576603 Slow queries: 10 Opens: 2238 Flush tables: 1 Open tables: 7 Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015 Uptime: 109090 Threads: 2 Questions: 580407 Slow queries: 10 Opens: 2253 Flush tables: 1 Open tables: 6 Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015 Uptime: 109690 Threads: 2 Questions: 583895 Slow queries: 10 Opens: 2268 Flush tables: 1 Open tables: 8 Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015 Uptime: 110290 Threads: 1 Questions: 586891 Slow queries: 10 Opens: 2279 Flush tables: 1 Open tables: 6 Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015 Uptime: 110890 Threads: 2 Questions: 590871 Slow queries: 10 Opens: 2292 Flush tables: 1 Open tables: 5 Queries per second avg: 5.328"
## first just tack on the date label
txt <- gsub('^', 'Date: ', readLines(textConnection(txt)))
选项1
sp <- strsplit(txt, '\\s{2,}')
out <- lapply(sp, function(x) gsub('([\\w ]+:)\\s+(.*)$', '\\2', x, perl = TRUE))
dd <- setNames(do.call('rbind.data.frame', out),
gsub('([\\w ]+):\\s+(.*)$', '\\1', sp[[1]], perl = TRUE))
dd[, -1] <- lapply(dd[, -1], function(x) as.numeric(as.character(x)))
dd
选项2:此选项使用yaml
包,但更直接,并为您进行类型转换
yml <- gsub('\\s{2,}', '\n', txt)
do.call('rbind.data.frame', lapply(yml, yaml::yaml.load))
# Date Uptime Threads Questions Slow queries Opens Flush tables
# 1 Fri Dec 11 12:40:01 CET 2015 108491 2 576603 10 2238 1
# 2 Fri Dec 11 12:50:01 CET 2015 109090 2 580407 10 2253 1
# 3 Fri Dec 11 13:00:01 CET 2015 109690 2 583895 10 2268 1
# 4 Fri Dec 11 13:10:01 CET 2015 110290 1 586891 10 2279 1
# 5 Fri Dec 11 13:20:01 CET 2015 110890 2 590871 10 2292 1
# Open tables Queries per second avg
# 1 7 5.314
# 2 6 5.320
# 3 8 5.323
# 4 6 5.321
# 5 5 5.328