如何在R

时间:2016-06-27 13:09:56

标签: regex r

我有一个我要解析的日志条目文件。所有的行都是这样的:

F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)

每个部分都有一个特定的名称,我将在下面列出。

  1. F - 行的标识符

  2. 20160525 - date(yyyymmdd)

  3. 17:52:38.791 - 时间戳(HH:MM:SS.sss)

  4. F798259D - 转移标识符

  5. 156.145.15.85:46634 - IP地址及相关端口

  6. xqixh8sl - 用户名

  7. AES - 加密级别(可以是 - (破折号))

  8. " / PCGC ... fastq.gz" - 转移文件(在")

  9. "" - 附加字符串(应该为空"")

  10. 2951144113 - 转移字节

  11. (0,0) - 错误

  12. " 2289.47秒(10.3兆比特/秒)" - 有关转移的数据

  13. 我已经导入了数据文件并使用read.pattern()函数来解析并将其分成它的字段。我只想要与2,3,4,5,6,7,8,10和12相关的信息。但是,我无法使模式正确。在此之前,我设法通过使用此模式获得了我需要的两个字段:

    pattern <- "^F ([0-9]+) [^ ]* .* \\(0,0\\) (.*)$"
    

    这给了我一个如下所示的数据框:

        date        speed of data transfer
    1 20160525 "1.62 seconds (1.30 kilobits/sec)"
    2 20160525 "0.29 seconds (1.93 kilobits/sec)"
    3 20160525 "0.01 seconds (34.0 kilobits/sec)"
    4 20160525 "0.01 seconds (102 kilobits/sec)"
    5 20160525 "38.05 seconds (214 megabits/sec)"
    

    这些只是我需要的两个字段,但每当我尝试添加更多那些我弄乱语法的地方时。例如:

    pattern <- "^F\\s([0-9]+)\\s[0-9:.]+\\s([:alnum:])\\s[A-Z]\\s([0-9.:]+)\\s([:alnum:])\\s([•])\\s[:punct:][A-z][:punct:]\\s[:punct:]\\s.* \\(0,0\\) (.*)$"
    

    这不起作用。有人可以帮忙写这个吗?这让我发疯了。谢谢!

2 个答案:

答案 0 :(得分:0)

这是我的解决方案:

library(stringer)
con <- readLines("dataSet.txt")
pattern <- "^F (\\d+) ([:graph:]+) ([:graph:]+) [A-Z]+ ([:graph:]+) ([:graph:]+) ([:graph:]+) ([:graph:]+) [:graph:]+ (\\d+) [:graph:]+ (.+)$"
matches <- str_match(con,pattern)
df <- data.frame(na.omit(matches[,-1]))
colnames(df) <- c("date", "timestamp", "transfer ID", "IP address", "username", "encryption level", "transferred file", "transferred bytes", "speed of data transfer")

这是结果:

1 20160525 08:22:06.838 F798256B 10.199.194.38:57708 wei2dt - "" 264 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 08:28:26.920 F798256C 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp-sv.tmp" 69 "0.29 seconds (1.93 kilobits/sec)"

答案 1 :(得分:-1)

如果你的所有线都遵循相似的结构,你可以通过简单地拆分空间上的每一行来逃脱。

x <- "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)"

library(dplyr)
library(magrittr)
strsplit(x, " ") %>%
  unlist() %>%
  t() %>%
  as.data.frame(stringsAsFactors = FALSE) %>%
  setNames(c("id", "date", "timestamp", "transfer_id",
             "curl_method", "ip_address", "username", "encryption",
             "tranferred_file", "additional_string",
             "transferred_bytes", "error",
             "rate1", "rate2", "rate3", "rate4")) %>%
  mutate(rate = paste(rate1, rate2, rate3, rate4)) %>%
  select(-rate1:-rate4)