我正在尝试使用R来解析这些类型的文件以解析信息并将数据放在数据框中,如格式:
这是文件的内容:
last_run current_run seconds
------------------------------- ------------------------------- -----------
Jul 4 2016 7:17AM Jul 4 2016 7:21AM 226
Engine Utilization (Tick %) User Busy System Busy I/O Busy Idle
------------------------- ------------ ------------ ---------- ----------
ThreadPool : syb_default_pool
Engine 0 5.0 % 0.4 % 22.4 % 72.1 %
Engine 1 3.9 % 0.5 % 22.8 % 72.8 %
Engine 2 5.6 % 0.3 % 22.5 % 71.6 %
Engine 3 5.1 % 0.4 % 22.7 % 71.8 %
------------------------- ------------ ------------ ---------- ----------
Pool Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 %
Average 4.2 % 0.3 % 22.9 % 72.5 %
------------------------- ------------ ------------ ---------- ----------
Server Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 %
Average 4.2 % 0.3 % 22.9 % 72.5 %
Transaction Profile
-------------------
Transaction Summary per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Committed Xacts 137.3 n/a 41198 n/a
Average Runnable Tasks 1 min 5 min 15 min % of total
------------------------- ------------ ------------ ---------- ----------
ThreadPool : syb_default_pool
Global Queue 0.0 0.0 0.0 0.0 %
Engine 0 0.0 0.1 0.1 0.6 %
Engine 1 0.0 0.0 0.0 0.0 %
Engine 2 0.2 0.1 0.1 2.6 %
------------------------- ------------ ------------ ----------
Pool Summary Total 7.2 5.9 6.1
Average 0.1 0.1 0.1
------------------------- ------------ ------------ ----------
Server Summary Total 7.2 5.9 6.1
Average 0.1 0.1 0.1
Device Activity Detail
----------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_125
datadev_125 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 n/a
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_126
datadev_126 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 n/a
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_127
datadev_127 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Reads
APF 0.0 0.0 5 0.4 %
Non-APF 0.0 0.0 1 0.1 %
Writes 3.8 0.0 1128 99.5 %
------------------------- ------------ ------------ ---------- ----------
Total I/Os 3.8 0.0 1134 0.1 %
Mirror Semaphore Granted 3.8 0.0 1134 100.0 %
Mirror Semaphore Waited 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/sybaser/database/sybaseR/dev/sybaseR.datadev_000
GPS_datadev_000 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Reads
APF 7.9 0.0 2372 55.9 %
Non-APF 5.5 0.0 1635 38.6 %
Writes 0.8 0.0 233 5.5 %
------------------------- ------------ ------------ ---------- ----------
Total I/Os 14.1 0.0 4240 0.3 %
Mirror Semaphore Granted 14.1 0.0 4239 100.0 %
Mirror Semaphore Waited 0.0 0.0 2 0.0 %
我需要捕捉" 2016年7月4日上午7:21"作为日期, 来自"引擎利用率(Tick%)行,Server Summary - > Average" 4.2%"
来自"交易资料" section - > Transaction Profile" count"条目。
所以,我的数据框应该是这样的:
Date Cpu Count
Jul 4 2016 7:21AM 4.2 41198
有人可以帮我解析如何解析这个文件以获得这些输出吗?
我尝试过这样的事情:
read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])
得到这一行:
Average 4.2 % 0.3 % 22.9 % 72.5 %
但我希望能够在
后立即提取平均值引擎利用率(Tick%),因为可能有许多以Average开头的行。在引擎利用率(Tick%)之后立即显示的平均线是我想要的。
如何将其放在此行中以从此文件中提取此信息:
read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])
我可以在此read.table行中使用grep来搜索某些字符吗?
答案 0 :(得分:2)
%%%% Shot 1 - 有所作为
List<Embarcaciones> embarcaciones = itemsEmbarcaciones.Select(x => new Embarcaciones() {
Categoria = (idioma == "Espanol" ? (x["Categoria"] == null ? ... : ...) : ...),
Title = ...
...
}).ToList();
%%% Shot 2:第一次尝试提取(可能是可变的)设备列数
extract <- function(filenam="file.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}
print(extract("file.txt"))
##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))
print(merged)
file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop
%%%%%%% Shot 3:提取两个表,一个包含一行,另一个包含可变行数(取决于每个sysmon文件中列出的设备)。
extractv2 <- function(filenam="file2.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
## Total I/Os
## 1. Each line "Device:" is assumed to be the header of a block of lines
## containing info about a single device (there are 4 such blocks
## in your example);
## 2. each block is assumed to contain one or more lines matching
## "Total I/Os";
## 3. the relevant count data is assumed to be contained in the last
## of such lines (at column 4), for each block.
## Approach: loop on the line numbers of those lines matching "Device:"
## to get: A. counts; B. device names
ii_block_dev <- grep("Device:", txt, fixed=TRUE)
ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
nblocks <- length(ii_block_dev)
## A. get counts for each device
## for each block, select *last* line matching "Total I/Os"
ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
## select matching liens to "Total I/Os" within each block
IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] &
ii_lines_IOs < ii_block_dev_aux[block+1]
]
tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
})
lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
Filter(function(v) v!="", strsplit(strng," ")[[1]])
})
IOs_counts <- sapply(lines_IOs, function(v) v[5])
## B. get device names:
## assumed to be on lines following each "Device:" match
ii_devices <- 1 + ii_block_dev
device_names <- sapply(ii_devices, function(ii){
Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
})
## Create a data.frame with "device_names" as column names and "IOs_counts" as
## the values of a single row.
## Sorting the device names by order() will help produce the same column names
## if different sysmon files list the devices in different order
ord <- order(device_names)
devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
check.names=FALSE) ## Prevent R from messing with our device names
data.frame(stringsAsFactors=FALSE, check.names=FALSE,
Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))
## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)
答案 1 :(得分:2)
使用专用程序有时可以更轻松地操作文本文件。例如。 gawk
专门用于查找文本文件中的模式并从中输出数据。我们可以使用一个简短的gawk脚本来获取所需的数据以加载到R.注意,脚本的每一行都包含要查找的模式,然后是{}
中包含的操作。 NR
是一个计数器,用于计算到目前为止读取的行数。
BEGIN {OFS = ""; ORS = ""}
/current_run/ {dat_line = NR+2; cpu_done = 0}
/Server Summary/ {cpu_line = NR+1}
/Transaction Summary/ {cnt_line = NR+2}
NR == dat_line {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0 {print $2," "; cpu_done = 1}
NR == cnt_line {print $5,"\n"}
使用名称“ext.awk”保存此脚本,然后将所有数据文件提取到R数据框中(假设它们都位于一个文件夹中并且扩展名为.txt
)
df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))
注意,gawk已经准备好安装在大多数Linux版本上。在Windows上,您可能需要从http://gnuwin32.sourceforge.net/packages/gawk.htm
答案 2 :(得分:0)
用于阅读文件 这里我假设CSV为文件类型。 其他人请访问 的 http://www.r-tutor.com/r-introduction/data-frame/data-import 强>
>utilization <- read.csv(file="",head=TRUE)
>serverSummary <-read.csv(file="",head=TRUE)
>transcProfile <- read.csv(file="",head=TRUE)
==&gt;合并只接受两个参数
>data <- merge(utilization,serverSummary)
>dataframe <-merge(data,transcProfile)
现在您将拥有数据框中的所有列
>dataframe
你可以看到dataframe中的所有列
根据需要对列进行预测 ==&gt; subset()函数是选择变量和观测值的最简单方法
>subset(dataframe,select=c("last_run","Average","Transaction Profile")
现在您可以将其写入CSV或任何文件类型
>write.csv(dataframe, file = "MyData.csv")
将所有文件合并在一起
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
运行代码来定义函数后,您就可以使用它了。该函数采用一条路径。此路径应该是包含您要读取和合并的所有文件的文件夹的名称,并且只包含您要合并的文件。考虑到这一点,我有两个提示:
在使用此功能之前,我的建议是在短目录中创建一个新文件夹(例如,此文件夹的路径可能是“C:// R // mergeme”)并保存所有文件想要合并到该文件夹中。 此外,请确保将在每个文件中以相同的方式(并且具有相同的名称)格式化将进行匹配的列。 假设您将20个文件保存到“C:// R // mergeme”的mergeme文件夹中,并且您希望读取并合并它们。要使用我的函数,请使用以下语法:
mymergeddata = multmerge(“C://R//mergeme”)
运行此命令后,您将拥有一个完全合并的数据框,其中所有变量都相互匹配
现在,您可以根据所需列对数据框进行子集化。
答案 3 :(得分:0)
使用readLines
或stringi::stri_read_lines
将文件内容作为字符向量读取。后者通常更快,但不那么成熟,偶尔会打破不寻常的内容。
lines <- readLines("the file name")
对于快速常规表达匹配,stringi
通常是最佳选择。 rebus.datetimes
允许您从strptime
日期格式字符串生成正则表达式。
current_run
出现的行与:
library(stringi)
library(rebus.datetimes)
i_current_run <- which(stri_detect_fixed(lines, "current_run"))
要提取日期,此代码仅查看找到当前运行的第二行之后的第二行,但代码是可向量化的,因此如果您有不假设的文件,则可以轻松查看所有行持。
date_format <- "%b%t%d%t%Y%t%H:%M%p"
rx_date <- rebus.datetimes::datetime(date_format, io = "input")
extracted_dates <- stri_extract_all_regex(lines[i_current_run + 2], rx_date)
current_run_date <- strptime(
extracted_dates[[1]][2], date_format, tz = "UTC"
)
## [1] "2016-07-04 07:21:00 UTC"
“引擎利用率”部分可通过
找到i_engine_util <- which(
stri_detect_fixed(lines, "Engine Utilization (Tick %)")
)
我们希望在此行之后出现“服务器摘要”的第一个实例。
i_server_summary <- i_engine_util +
min(which(
stri_detect_fixed(lines[(i_engine_util + 1):n_lines], "Server Summary")
))
使用正则表达式从下一行中提取数字。
user_busy <- as.numeric(
stri_extract_first_regex(lines[i_server_summary + 1], "[0-9]+(?:\\.[0-9])")
)
## [1] 4.2
“Committed Xacts”行是
i_comm_xacts <- which(stri_detect_fixed(lines, "Committed Xacts"))
计数值是由空格包围的一组数字。
xacts_count <- as.integer(
stri_extract_all_regex(lines[i_comm_xacts], "(?<= )[0-9]+(?= )")
)
## [1] 41198
data.frame(
Date = current_run_date,
CPU = user_busy,
Count = xacts_count
)