不确定如何正确拆分字符串并使用R中的变量打印出来

时间:2016-07-30 22:54:31

标签: r split output character

我正在尝试逐行读取NASA数据的日志文件,然后分成5列。但是现在似乎没有正确分裂,另一个问题是没有常见的分裂字符。

fileName <- 'C:/Users/xxxxx/Desktop/access_log_Jul95.txt'        
fileConn<-file('C:/Users/xxxxx/Desktop/output.txt')                
conn <- file(fileName,open="r")
linn <-readLines(conn)
fo00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0

这是我想要的输出:

199.72.81.55, [01/Jul/1995:00:00:01 -0400], GET, /history/apollo/ HTTP/1.0, 200, 6245

2 个答案:

答案 0 :(得分:2)

不像@Psidom解决方案那么优雅,但这可以完成工作:

library(stringr)
library(dplyr)

   df <- str_split(linn, " ") %>% 
        do.call(rbind, .) %>% 
        as.data.frame() %>%
        mutate(V6 = str_replace(df$V6, '"', ""),
               V8 = str_replace(df$V8, '"', ""),
               a = paste(V4, V5),
               b = paste0(V7, V8)) %>%
        select(c(1, 11, 6, 12, 9, 10))
# Clean up the column names
names(df) <- paste0("V", seq_along(1:ncol(df)))

输出:

                    V1                           V2  V3                                                      V4  V5   V6
1         199.72.81.55 [01/Jul/1995:00:00:01 -0400] GET                                /history/apollo/HTTP/1.0 200 6245
2 unicomp6.unicomp.net [01/Jul/1995:00:00:06 -0400] GET                             /shuttle/countdown/HTTP/1.0 200 3985
3       199.120.110.21 [01/Jul/1995:00:00:09 -0400] GET    /shuttle/missions/sts-73/mission-sts-73.htmlHTTP/1.0 200 4085
4   burger.letters.com [01/Jul/1995:00:00:11 -0400] GET                 /shuttle/countdown/liftoff.htmlHTTP/1.0 304    0
5       199.120.110.21 [01/Jul/1995:00:00:11 -0400] GET /shuttle/missions/sts-73/sts-73-patch-small.gifHTTP/1.0 200 4179
6   burger.letters.com [01/Jul/1995:00:00:12 -0400] GET                      /images/NASA-logosmall.gifHTTP/1.0 304    0
7   burger.letters.com [01/Jul/1995:00:00:12 -0400] GET          /shuttle/countdown/video/livevideo.gifHTTP/1.0 200    0

答案 1 :(得分:1)

尝试使用此正则表达式( - - |(?<=]) |(?<=\\") |(?<=\\d) (?=\\d))进行拆分:

lines <- readLines(conn)
do.call(rbind, 
  lapply(lines, function(line) strsplit(line, '( - - |(?<=]) |(?<=\\") |(?<=\\d) (?=\\d))', perl = T)[[1]]))

#      [,1]                   [,2]                           [,3]                                                               [,4]  [,5]  
# [1,] "199.72.81.55"         "[01/Jul/1995:00:00:01 -0400]" "\"GET /history/apollo/ HTTP/1.0\""                                "200" "6245"
# [2,] "unicomp6.unicomp.net" "[01/Jul/1995:00:00:06 -0400]" "\"GET /shuttle/countdown/ HTTP/1.0\""                             "200" "3985"
# [3,] "199.120.110.21"       "[01/Jul/1995:00:00:09 -0400]" "\"GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0\""    "200" "4085"
# [4,] "burger.letters.com"   "[01/Jul/1995:00:00:11 -0400]" "\"GET /shuttle/countdown/liftoff.html HTTP/1.0\""                 "304" "0"   
# [5,] "199.120.110.21"       "[01/Jul/1995:00:00:11 -0400]" "\"GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0\"" "200" "4179"
# [6,] "burger.letters.com"   "[01/Jul/1995:00:00:12 -0400]" "\"GET /images/NASA-logosmall.gif HTTP/1.0\""                      "304" "0"   
# [7,] "burger.letters.com"   "[01/Jul/1995:00:00:12 -0400]" "\"GET /shuttle/countdown/video/livevideo.gif HTTP/1.0\""          "200" "0"