Question

以下是我们软件的日志文件示例。我喜欢在R语言的帮助下分析这些数据，以获得一些洞察信息。

30-Mar-14 17：59：58.1244（6628 6452）Module1.exe：Program1.cpp，v：854：错误：组7失败，错误= 0x8004000f

30-Mar-14 17：59：58.1254（6628 6452）Module1.exe：Program1.cpp，v：880：ERROR：第3组在3次重试时失败

30-Mar-14 18：00：04.8491（-1 1376 13900）Module2.exe：执行：803：信息 - 执行命令1

30-Mar-14 18：00：08.6213（-1 1376 13900）Module2.exe：执行：603：信息 - 命令1已完成。

30-Mar-14 18：00：08.6273（-1 1376 13900）Module2.exe：执行：803：信息 - 执行命令2

每个日志文件包含20k行，我们有大量日志文件。

我的要求是拆分如下。

| 30-Mar-14 | 17：59：58.1244 | （6628 6452） | Module1 .exe：Program1.cpp，v | ：854： | 错误：第7组失败，错误= 0x8004000f |

我尝试使用＆＃34;导入数据集＆＃34;导入此数据集。 - ＆gt;＆＃34;来自文件＆＃34;在R工作室。我尝试了不同的选项。但它无法识别这些领域。是否有基于模式或正则表达式的选项拆分？

软件环境：

R语言v3.0.3
R studio
Windows 7

注意：我已编辑日志文件以删除实际模块名称。

Answer 1

GUI本身没有这样的选项（例如，与Excel或SPSS不同，它可能具有更强大的GUI导入选项）。你需要一个脚本。

您可以使用与所有行匹配的占位符构造正则表达式，并调用gsub以提取占位符中的值。例如：

text <- readLines("log.log")
rx <- "^([0-9]+-[^-]+[0-9]+) +([0-9]+:[0-9]+:[0-9]+[.][0-9]+) +.*$"
stopifnot(grepl(rx, text))

然后：

date <- gsub(rx, "\\1", text)
time <- gsub(rx, "\\2", text)
date.time.df <- data.frame(date, time)

或者：

date.time <- gsub(rx, "\\1\n\\2", text)
date.time.l <- strsplit(date.time, "\n")
do.call(rbind, date.time.l)

增强rx以匹配其他字段。

Answer 2

这是一个可以执行此操作的脚本：

x <- scan(text = "30-Mar-14 17:59:58.1244 (6628 6452) Module1.exe:Program1.cpp,v:854: ERROR: group 7 failed with error = 0x8004000f

30-Mar-14 17:59:58.1254 (6628 6452) Module1.exe:Program1.cpp,v:880: ERROR: group 7 failed on its 3 retry

30-Mar-14 18:00:04.8491 ( -1 1376 13900) Module2.exe:Execute:803: Information - Executing command 1

30-Mar-14 18:00:08.6213 ( -1 1376 13900) Module2.exe:Execute:603: Information - command 1 completed.

30-Mar-14 18:00:08.6273 ( -1 1376 13900) Module2.exe:Execute:803: Information - Executing command 2",
    what = '', sep = '\n')

# pull off date/time
dateTime <- sapply(strsplit(x, ' '), '[', 1:2)
# piece together with "|"
dateTime <- apply(dateTime, 2, paste, collapse = "|")
newX <- sub("^[^ ]+ [^(]+", "", x) 
# extract the data in parenthesises
par1 <- sub("(\\([^)]+\\)).*", "\\1", newX)
newX <- sub("[^)]+\\)", "", newX)  # remove data just matched

# parse the rest of the data
x <- strsplit(newX, ":")
y <- sapply(x, function(.line){
    paste(c(paste(c(.line[1], .line[2]), collapse = ":")
      , paste0(":", .line[3], ":")
      , paste(.line[-(1:3)], collapse = ":")
      ), collapse = "|")
})

# put it all back together
paste0("|"
    , dateTime
    , "|"
    , par1
    , "|"
    , y
    , "|"
    )

以下是脚本的输出：

[1] "|30-Mar-14|17:59:58.1244|(6628 6452)| Module1.exe:Program1.cpp,v|:854:| ERROR: group 7 failed with error = 0x8004000f|"
[2] "|30-Mar-14|17:59:58.1254|(6628 6452)| Module1.exe:Program1.cpp,v|:880:| ERROR: group 7 failed on its 3 retry|"         
[3] "|30-Mar-14|18:00:04.8491|( -1 1376 13900)| Module2.exe:Execute|:803:| Information - Executing command 1|"              
[4] "|30-Mar-14|18:00:08.6213|( -1 1376 13900)| Module2.exe:Execute|:603:| Information - command 1 completed.|"             
[5] "|30-Mar-14|18:00:08.6273|( -1 1376 13900)| Module2.exe:Execute|:803:| Information - Executing command 2|"

在R中导入非结构化软件日志文件？

2 个答案: