在R中解析日志文件

时间:2014-08-08 09:53:01

标签: regex r logging

我是R的新手。我正在尝试解析一些博客,以便进行一些分析。到目前为止,我可以提取用户名,日期和应用程序(固定宽度的所有内容),但我想提取该人正在查找的信息,这些信息稍微更加非结构化。

raw_data <- c('2014-08-06 09:00:27554swomey                       SingleCustomerView                                                                   name=JOHN, nameEntity=JOHN, ppsn=1234567C, address1=123 Fake Street, dob=11/11/1911,',
'2014-08-06 09:00:30302swomey                       SingleCustomerView                 327FF1F4AFF3EE7C2C6334072CDE1401                  execution=e1s1, ',
'2014-08-06 10:01:38648agnolan                      SingleCustomerView                                                                   address1=123 FAKE STREET, dob=11/11/1911, name=JOHN SMITH, nameEntity=BLAH, ppsn=1234567E, ',
'2014-08-06 10:01:39552agnolan                      SingleCustomerView                 C3D63A0B53A43BBBDB7F76E55E906D74                  execution=e1s1, ')

splitdata <- data.frame(date=substr(raw_data,0,22), username=substr(raw_data,23,52), 
                    application_name=substr(raw_data,53,87), session_id=substr(raw_data,88,137))

提取其他信息(例如name,nameEntity,ppsn,address等)的最佳方法是什么,记住每个变量可能不在每行中。

我沿着这些方向尝试了一些东西,但我感到很困惑。我假设我需要使用apply函数?

x <- "name=JOHN, nameEntity=JOHN, ppsn=1234567C, "

pattern <- "ppsn=(\\w+)"

match   <- regexec(pattern, x)
words   <- regmatches(x, match)

match
words

非常感谢。

编辑:道歉,日志文件实际上看起来像这样,地址行中有逗号,所以用逗号分隔并不容易。

raw_data <- c('2014-08-06 09:00:27554swomey                       SingleCustomerView                                                                   name=JOHN, nameEntity=JOHN, ppsn=1234567C, address1=123 Fake Street, Dublin, Ireland, dob=11/11/1911,',
'2014-08-06 09:00:30302swomey                       SingleCustomerView                 327FF1F4AFF3EE7C2C6334072CDE1401                  execution=e1s1, ',
'2014-08-06 10:01:38648agnolan                      SingleCustomerView                                                                   address1=123 FAKE STREET, CORK, IRELAND, dob=11/11/1911, name=JOHN SMITH, nameEntity=BLAH, ppsn=1234567E, ',
'2014-08-06 10:01:39552agnolan                      SingleCustomerView                 C3D63A0B53A43BBBDB7F76E55E906D74                  execution=e1s1, ')

1 个答案:

答案 0 :(得分:1)

这是一种有趣的日志格式。如果你想要一个数据帧,这应该可行(有几种方法可以做到这一点......这会拆分字符串与处理正则表达式捕获对):

res <- lapply(raw_data, function(x) {

  # get the unstructured/variable bits

  info <- gsub(",\ *$", "", substr(x, 137, nchar(x)))

  # and process them

  lapply(unlist(strsplit(info, ",\ +")), function(y) {

    # return name/value pairs in a list

    fields <- unlist(strsplit(y, "="))
    ret <- list()
    ret[[ fields[1] ]] <- fields[2]
    ret

  })

})

# make a data frame from them, filling in missing bits with NA

rbind.fill(lapply(res, as.data.frame))

##         name nameEntity     ppsn        address1        dob execution
## 1       JOHN       JOHN 1234567C 123 Fake Street 11/11/1911      <NA>
## 2       <NA>       <NA>     <NA>            <NA>       <NA>      e1s1
## 3 JOHN SMITH       BLAH 1234567E 123 FAKE STREET 11/11/1911      <NA>
## 4       <NA>       <NA>     <NA>            <NA>       <NA>      e1s1

您可以使用固定宽度字段处理的内容cbind向后移动。