如何使用R解析sysmon文件以提取某些信息?

时间:2016-08-03 20:35:23

标签: regex r parsing pcre

我正在尝试使用R来解析这些类型的文件以解析信息并将数据放在数据框中,如格式:

这是文件的内容:

    last_run                        current_run                     seconds     
 ------------------------------- ------------------------------- ----------- 
             Jul  4 2016  7:17AM             Jul  4 2016  7:21AM         226 


Engine Utilization (Tick %)   User Busy   System Busy    I/O Busy        Idle
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Engine 0                         5.0 %         0.4 %      22.4 %      72.1 % 
   Engine 1                         3.9 %         0.5 %      22.8 %      72.8 % 
   Engine 2                         5.6 %         0.3 %      22.5 %      71.6 % 
   Engine 3                         5.1 %         0.4 %      22.7 %      71.8 % 

     -------------------------  ------------  ------------  ----------  ---------- 
  Pool Summary        Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

  -------------------------  ------------  ------------  ----------  ---------- 
  Server Summary      Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

Transaction Profile
-------------------

  Transaction Summary             per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Committed Xacts                 137.3           n/a       41198     n/a     

     Average Runnable Tasks            1 min         5 min      15 min  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Global Queue                       0.0           0.0         0.0       0.0 %
   Engine 0                           0.0           0.1         0.1       0.6 %
   Engine 1                           0.0           0.0         0.0       0.0 %
   Engine 2                           0.2           0.1         0.1       2.6 %

  -------------------------  ------------  ------------  ----------             
  Pool Summary        Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1             

  -------------------------  ------------  ------------  ----------             
  Server Summary      Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1 

Device Activity Detail
  ----------------------

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_125                                         
    datadev_125                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_126                                         
    datadev_126                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_127                                         
    datadev_127                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             0.0           0.0           5       0.4 %
      Non-APF                         0.0           0.0           1       0.1 %
    Writes                            3.8           0.0        1128      99.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          3.8           0.0        1134       0.1 %

  Mirror Semaphore Granted            3.8           0.0        1134     100.0 %
  Mirror Semaphore Waited             0.0           0.0           0       0.0 %

  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /sybaser/database/sybaseR/dev/sybaseR.datadev_000                                    
    GPS_datadev_000               per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             7.9           0.0        2372      55.9 %
      Non-APF                         5.5           0.0        1635      38.6 %
    Writes                            0.8           0.0         233       5.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                         14.1           0.0        4240       0.3 %

  Mirror Semaphore Granted           14.1           0.0        4239     100.0 %
  Mirror Semaphore Waited             0.0           0.0           2       0.0 %

我需要捕捉" 2016年7月4日上午7:21"作为日期, 来自"引擎利用率(Tick%)行,Server Summary - > Average" 4.2%"

来自"交易资料" section - > Transaction Profile" count"条目。

所以,我的数据框应该是这样的:

Date                     Cpu   Count
Jul  4 2016  7:21AM      4.2   41198 

有人可以帮我解析如何解析这个文件以获得这些输出吗?

我尝试过这样的事情:

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

得到这一行:

Average         4.2 %         0.3 %      22.9 %      72.5 % 

但我希望能够在

后立即提取平均值

引擎利用率(Tick%),因为可能有许多以Average开头的行。在引擎利用率(Tick%)之后立即显示的平均线是我想要的。

如何将其放在此行中以从此文件中提取此信息:

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

我可以在此read.table行中使用grep来搜索某些字符吗?

4 个答案:

答案 0 :(得分:2)

%%%% Shot 1 - 有所作为

List<Embarcaciones> embarcaciones = itemsEmbarcaciones.Select(x => new Embarcaciones() {
    Categoria = (idioma == "Espanol" ? (x["Categoria"] == null ? ... : ...) : ...),
    Title = ...
    ...
}).ToList();

%%% Shot 2:第一次尝试提取(可能是可变的)设备列数

extract <- function(filenam="file.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}

print(extract("file.txt"))

##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))

print(merged)

file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop

%%%%%%% Shot 3:提取两个表,一个包含一行,另一个包含可变行数(取决于每个sysmon文件中列出的设备)。

extractv2 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]


    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
    nblocks <- length(ii_block_dev)
    ## A. get counts for each device
    ## for each block, select *last* line matching "Total I/Os"
    ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
    ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
        ## select matching liens to "Total I/Os" within each block
        IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                       ii_lines_IOs < ii_block_dev_aux[block+1]
                                   ]
        tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
    })
    lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
        Filter(function(v) v!="", strsplit(strng," ")[[1]])
    })
    IOs_counts <- sapply(lines_IOs, function(v) v[5])
    ## B. get device names:
    ## assumed to be on lines following each "Device:" match
    ii_devices <- 1 + ii_block_dev
    device_names <- sapply(ii_devices, function(ii){
        Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    })
    ## Create a data.frame with "device_names" as column names and "IOs_counts" as
    ## the values of a single row.
    ## Sorting the device names by order() will help produce the same column names
    ## if different sysmon files list the devices in different order
    ord <- order(device_names)
    devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
                             check.names=FALSE) ## Prevent R from messing with our device names

    data.frame(stringsAsFactors=FALSE, check.names=FALSE,
               Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))


## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)

答案 1 :(得分:2)

使用专用程序有时可以更轻松地操作文本文件。例如。 gawk专门用于查找文本文件中的模式并从中输出数据。我们可以使用一个简短的gawk脚本来获取所需的数据以加载到R.注意,脚本的每一行都包含要查找的模式,然后是{}中包含的操作。 NR是一个计数器,用于计算到目前为止读取的行数。

BEGIN                          {OFS = ""; ORS = ""}
/current_run/                  {dat_line = NR+2; cpu_done = 0}
/Server Summary/               {cpu_line = NR+1}
/Transaction Summary/          {cnt_line = NR+2}
NR == dat_line                 {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0  {print $2," "; cpu_done = 1}
NR == cnt_line                 {print $5,"\n"}

使用名称“ext.awk”保存此脚本,然后将所有数据文件提取到R数据框中(假设它们都位于一个文件夹中并且扩展名为.txt

df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))

注意,gawk已经准备好安装在大多数Linux版本上。在Windows上,您可能需要从http://gnuwin32.sourceforge.net/packages/gawk.htm

安装它

答案 2 :(得分:0)

用于阅读文件 这里我假设CSV为文件类型。 其他人请访问  的 http://www.r-tutor.com/r-introduction/data-frame/data-import

>utilization <- read.csv(file="",head=TRUE)
>serverSummary <-read.csv(file="",head=TRUE)
>transcProfile <- read.csv(file="",head=TRUE)

==&gt;合并只接受两个参数

>data <- merge(utilization,serverSummary)
>dataframe <-merge(data,transcProfile)

现在您将拥有数据框中的所有列

>dataframe

你可以看到dataframe中的所有列

根据需要对列进行预测 ==&gt; subset()函数是选择变量和观测值的最简单方法

>subset(dataframe,select=c("last_run","Average","Transaction Profile")

现在您可以将其写入CSV或任何文件类型

>write.csv(dataframe, file = "MyData.csv")

将所有文件合并在一起

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)

运行代码来定义函数后,您就可以使用它了。该函数采用一条路径。此路径应该是包含您要读取和合并的所有文件的文件夹的名称,并且只包含您要合并的文件。考虑到这一点,我有两个提示:

在使用此功能之前,我的建议是在短目录中创建一个新文件夹(例如,此文件夹的路径可能是“C:// R // mergeme”)并保存所有文件想要合并到该文件夹​​中。 此外,请确保将在每个文件中以相同的方式(并且具有相同的名称)格式化将进行匹配的列。 假设您将20个文件保存到“C:// R // mergeme”的mergeme文件夹中,并且您希望读取并合并它们。要使用我的函数,请使用以下语法:

mymergeddata = multmerge(“C://R//mergeme”)

运行此命令后,您将拥有一个完全合并的数据框,其中所有变量都相互匹配

现在,您可以根据所需列对数据框进行子集化。

答案 3 :(得分:0)

使用readLinesstringi::stri_read_lines将文件内容作为字符向量读取。后者通常更快,但不那么成熟,偶尔会打破不寻常的内容。

lines <- readLines("the file name")

对于快速常规表达匹配,stringi通常是最佳选择。 rebus.datetimes允许您从strptime日期格式字符串生成正则表达式。

查找当前的运行日期

current_run出现的行与:

library(stringi)
library(rebus.datetimes)

i_current_run <- which(stri_detect_fixed(lines, "current_run"))

要提取日期,此代码仅查看找到当前运行的第二行之后的第二行,但代码是可向量化的,因此如果您有不假设的文件,则可以轻松查看所有行持。

date_format <- "%b%t%d%t%Y%t%H:%M%p"
rx_date <- rebus.datetimes::datetime(date_format, io = "input")
extracted_dates <- stri_extract_all_regex(lines[i_current_run + 2], rx_date)
current_run_date <- strptime(
  extracted_dates[[1]][2], date_format, tz = "UTC"
)
## [1] "2016-07-04 07:21:00 UTC"

查找%user busy

“引擎利用率”部分可通过

找到
i_engine_util <- which(
  stri_detect_fixed(lines, "Engine Utilization (Tick %)")
)

我们希望在此行之后出现“服务器摘要”的第一个实例。

i_server_summary <- i_engine_util + 
  min(which(
    stri_detect_fixed(lines[(i_engine_util + 1):n_lines], "Server Summary")
  ))

使用正则表达式从下一行中提取数字。

user_busy <- as.numeric(
  stri_extract_first_regex(lines[i_server_summary + 1], "[0-9]+(?:\\.[0-9])")
)
## [1] 4.2

查找已提交的xacts的数量

“Committed Xacts”行是

i_comm_xacts <- which(stri_detect_fixed(lines, "Committed Xacts"))

计数值是由空格包围的一组数字。

xacts_count <- as.integer(
  stri_extract_all_regex(lines[i_comm_xacts], "(?<= )[0-9]+(?= )")
)
## [1] 41198

合并结果

data.frame(
  Date = current_run_date,
  CPU = user_busy,
  Count = xacts_count
)