R新手在这里,这可能是显而易见的,但我只是没有正确地对待我的搜索。
我正在将Web服务器日志解析为data.table,我想通过从请求字符串中提取部分来创建一堆列。我的源数据如下所示:
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" 200 26294 "https://bela.com/home/amazeballs" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 2.031 2.031 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1" 200 4485 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.173 0.173 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" 200 4851 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.168 0.168 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1" 200 7499 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.290 0.290 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" 200 132880 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.366 0.366 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 1386 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.233 0.233 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 2121 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.108 0.108 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 3230 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.174 0.174 .
所以我敲了下面的代码:
alog <- fread('cat sample.log | grep -v "GET /junk" | cut -f 4,6- -d " " ')
setnames(alog, c("ip","remote_user","datetime","timezone","request","status","bytes","referer","user_agent","http_x_forwarded_for","request_time","upstream_response_time","pipe"))
request_parts <- function(x) {
m <- regexec("^([A-Z]+) /([^/]+)/([^\\?]+)(\\?[^ ]+)? HTTP/(.*)", x)
parts <- do.call(rbind, lapply(regmatches(x, m), `[`, c(2, 3, 4, 5, 6)))
colnames(parts) <- c("method","webapp","page","query_string", "http_version")
parts
}
parts <- request_parts(alog$request)
它似乎达到了一定的目的:
> alog$request [1] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1" [3] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1" [5] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" [7] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" > parts method webapp page query_string http_version [1,] "GET" "silly" "sales/1234567890" "?amazeballsTask=Y" "1.1" [2,] "GET" "silly" "jawr/css/gzip_N676825985/bundles/app.css" "" "1.1" [3,] "GET" "silly" "jawr/css/gzip_2073017426/bundles/lib.css" "" "1.1" [4,] "GET" "silly" "jawr/js/gzip_1764696599/bundles/app.js" "" "1.1" [5,] "GET" "silly" "jawr/js/gzip_N1319387470/bundles/lib.js" "" "1.1" [6,] "GET" "silly" "js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1" [7,] "GET" "silly" "styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1" [8,] "GET" "silly" "js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1"
但这不符合我的要求(将所有部分的列添加到alog上):
> alog$method
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # yay!
> alog$webapp
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # dismay :(
我做错了什么?有很多警告如下,但我并没有真正得到他们想告诉我的东西。
1: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) : 5 column matrix RHS of := will be treated as one vector 2: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) : Supplied 40 items to be assigned to 8 items of column 'method' (32 unused)
答案 0 :(得分:4)
parts
是一个矩阵;你必须转换为data.table才能工作。这是一个例子:
m <- matrix(1:25, nc=5)
colnames(m) <- LETTERS[1:5]
library(data.table)
dt <- data.table(x=1:5)
dt[,colnames(m):=m]
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(colnames(m), m)) :
# 5 column matrix RHS of := will be treated as one vector
# ...
dt # not what you want...
# x A B C D E
# 1: 1 1 1 1 1 1
# 2: 2 2 2 2 2 2
# 3: 3 3 3 3 3 3
# 4: 4 4 4 4 4 4
# 5: 5 5 5 5 5 5
dt[,colnames(m):=as.data.table(m)]
dt # better
# x A B C D E
# 1: 1 1 6 11 16 21
# 2: 2 2 7 12 17 22
# 3: 3 3 8 13 18 23
# 4: 4 4 9 14 19 24
# 5: 5 5 10 15 20 25