我是R的新手。我正在尝试解析一些博客,以便进行一些分析。到目前为止,我可以提取用户名,日期和应用程序(固定宽度的所有内容),但我想提取该人正在查找的信息,这些信息稍微更加非结构化。
raw_data <- c('2014-08-06 09:00:27554swomey SingleCustomerView name=JOHN, nameEntity=JOHN, ppsn=1234567C, address1=123 Fake Street, dob=11/11/1911,',
'2014-08-06 09:00:30302swomey SingleCustomerView 327FF1F4AFF3EE7C2C6334072CDE1401 execution=e1s1, ',
'2014-08-06 10:01:38648agnolan SingleCustomerView address1=123 FAKE STREET, dob=11/11/1911, name=JOHN SMITH, nameEntity=BLAH, ppsn=1234567E, ',
'2014-08-06 10:01:39552agnolan SingleCustomerView C3D63A0B53A43BBBDB7F76E55E906D74 execution=e1s1, ')
splitdata <- data.frame(date=substr(raw_data,0,22), username=substr(raw_data,23,52),
application_name=substr(raw_data,53,87), session_id=substr(raw_data,88,137))
提取其他信息(例如name,nameEntity,ppsn,address等)的最佳方法是什么,记住每个变量可能不在每行中。
我沿着这些方向尝试了一些东西,但我感到很困惑。我假设我需要使用apply函数?
x <- "name=JOHN, nameEntity=JOHN, ppsn=1234567C, "
pattern <- "ppsn=(\\w+)"
match <- regexec(pattern, x)
words <- regmatches(x, match)
match
words
非常感谢。
编辑:道歉,日志文件实际上看起来像这样,地址行中有逗号,所以用逗号分隔并不容易。
raw_data <- c('2014-08-06 09:00:27554swomey SingleCustomerView name=JOHN, nameEntity=JOHN, ppsn=1234567C, address1=123 Fake Street, Dublin, Ireland, dob=11/11/1911,',
'2014-08-06 09:00:30302swomey SingleCustomerView 327FF1F4AFF3EE7C2C6334072CDE1401 execution=e1s1, ',
'2014-08-06 10:01:38648agnolan SingleCustomerView address1=123 FAKE STREET, CORK, IRELAND, dob=11/11/1911, name=JOHN SMITH, nameEntity=BLAH, ppsn=1234567E, ',
'2014-08-06 10:01:39552agnolan SingleCustomerView C3D63A0B53A43BBBDB7F76E55E906D74 execution=e1s1, ')
答案 0 :(得分:1)
这是一种有趣的日志格式。如果你想要一个数据帧,这应该可行(有几种方法可以做到这一点......这会拆分字符串与处理正则表达式捕获对):
res <- lapply(raw_data, function(x) {
# get the unstructured/variable bits
info <- gsub(",\ *$", "", substr(x, 137, nchar(x)))
# and process them
lapply(unlist(strsplit(info, ",\ +")), function(y) {
# return name/value pairs in a list
fields <- unlist(strsplit(y, "="))
ret <- list()
ret[[ fields[1] ]] <- fields[2]
ret
})
})
# make a data frame from them, filling in missing bits with NA
rbind.fill(lapply(res, as.data.frame))
## name nameEntity ppsn address1 dob execution
## 1 JOHN JOHN 1234567C 123 Fake Street 11/11/1911 <NA>
## 2 <NA> <NA> <NA> <NA> <NA> e1s1
## 3 JOHN SMITH BLAH 1234567E 123 FAKE STREET 11/11/1911 <NA>
## 4 <NA> <NA> <NA> <NA> <NA> e1s1
您可以使用固定宽度字段处理的内容cbind
向后移动。