Question

我正在尝试解析系统日志行：

library(dplyr)
library(tidyr)
df %>% 
  # First create a new variable containing the month as a numeric between 1-12
  mutate(month = strftime(date, "%m")) %>% 
  # Make data tidy so basically there is new column col containing
  # month.1, month.2, month.3, ... and a column val containg
  # the values
  gather(col, val, -date, -month) %>% 
  # remove "month.m" so the col column has numeric values
  mutate_at("col", str_replace, pattern = "month.m", replacement = "") %>%
  mutate_at(c("month", "col"), as.numeric) %>% 
  # Compute the difference between the month column and the col column
  mutate(col = abs((col - month + 1) %% 12)) %>% 
  # Sort the dataframe according to the new col column
  arrange(month, col) %>% 
  # Add month.m to the col column so we redefine the names of the columns
  mutate(col = paste0("month.m", col), month = NULL) %>% 
  # Untidy the data frame
  spread(col, val)

我的目标是将这些数据分解为键/值对。它需要是perl regex（这恰好是针对solaris日志进入Splunk，以防有人对它的用途感到好奇）。

到目前为止，我有这个：

pam_vas: Authentication <succeeded> for <active directory> user: <bobtheperson> account: <bobtheperson@com.com> reason: <N/A> Access cont(upn): <bob>

它可以很好地提取我的数据，但是只要一个单词以冒号结尾，它就会包含在第一组中。

预期结果：

[\>\:]*\s+(.*?)\<(.+?)\>

实际结果（注意冒号）

Authentication = succeeded
for = active directory
user = bobtheperson
account = bobtheperson@com.com
reason = N/A
Access cont(upn) = bob

http://regexr.com/代码的链接： http://regexr.com/3fasr 很多反复试验让我到了这个位置 - 我只是想弄清楚如何取出最后一段标点符号。

Answer 1

这个正则表达式似乎适合你：

[\>\:]*\s+(.*?)\:?\s\<(.+?)\>

正如你在这里看到的： http://regexr.com/3fatg

IOR

Regular expression visualization

Debuggex Demo

正则表达式从一个组的末尾排除了一个charachter

1 个答案: