如何从Web日志中提取元素以形成data.frame?

时间:2013-11-18 14:54:03

标签: r

我有一个约100万行的博客,我想要提取一些日期,时间和状态来形成一个 新的data.frame。

       V1
       2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
       2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
       2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
       2013-08-27 16:00:04 117.79.149.2 GET 404 0 0

成为

       Date_Time              Status
       2013-08-27 16:00:01    200
       2013-08-27 16:00:02    404
       2013-08-27 16:00:03    200
       2013-08-27 16:00:04    404

我知道如何通过以下代码提取我需要的元素

       temp<-unlist(strsplit(x," "))
       Date_Time<-paste(temp[1],temp[2])
       Status<-temp[5]

但我不知道如何逐行执行它来获取没有“for”循环的新data.frame, 我怎样才能用它来解决或者解决它?

3 个答案:

答案 0 :(得分:3)

基于正则表达式的解决方案:

with(dat, data.frame(Date_Time = gsub("(.*\\:[0-9]+) .*", "\\1", V1),
                     Status = gsub(".*T ([0-9]+) .*", "\\1", V1)))

#             Date_Time Status
# 1 2013-08-27 16:00:01    200
# 2 2013-08-27 16:00:02    404
# 3 2013-08-27 16:00:03    200
# 4 2013-08-27 16:00:04    404

其中dat是您的数据框:

dat <- data.frame(V1 = readLines(
  textConnection("2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
2013-08-27 16:00:04 117.79.149.2 GET 404 0 0")))

答案 1 :(得分:0)

您可以使用sapply

example <- c("asdf asdwer dsf cswe asd","asfdw ewr cswe sdf wers")  
split.example <- strsplit(example," ")
example.2 <- sapply(split.example,"[[",2)

这给出了:

> example.2
[1] "asdwer" "ewr" 

使用@Sven提供的dat

,只是为了完整答案
temp <- strsplit(as.character(dat$V1)," ")
new.df <- data.frame(Date_Time = paste(sapply(temp,"[[",1),
                                       sapply(temp,"[[",2)),
                     Status = sapply(temp,"[[",5))

> new.df
            Date_Time Status
1 2013-08-27 16:00:01    200
2 2013-08-27 16:00:02    404
3 2013-08-27 16:00:03    200
4 2013-08-27 16:00:04    404

答案 2 :(得分:0)

mydf <- data.frame(V1=c("2013-08-27 16:00:01 117.79.149.2 GET 200 0 0",
   "2013-08-27 16:00:02 117.79.149.2 GET 404 0 0",
   "2013-08-27 16:00:03 117.79.149.2 GET 200 0 0",
   "2013-08-27 16:00:04 117.79.149.2 GET 404 0 0"))

# With fixed width fields
mydf[, c("Date_Time", "Status")] <- list(substring(mydf$V1, 1, 19),
                                         substring(mydf$V1, 38, 40))


# or based on the delimiter " " which is closer from your trial ...
strings <- unlist(strsplit(as.character(mydf$V1), " "))
mydf[, c("Date_Time", "Status")] <- list(paste(strings[seq(1, length.out=nrow(mydf), by=7)], strings[seq(2, length.out=nrow(mydf), by=7)]), 
                                         strings[seq(5, length.out=nrow(mydf), by=7)])