应用错误收集

Regexp在apache日志中找到httpb状态码为200的Googlebot

时间：2012-02-02 11:29:13

标签： regex apache grep tail

我正在寻找一个正则表达式，只过滤来自Googlebot的状态代码为200的行，如下所示：

xxx.xxx.xxx.xxx - - [02/Feb/2012:12:21:26 +0100] "GET /some/url/here HTTP/1.1" 200 9823 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

并没有显示这是一个重定向（301状态代码）：

xxx.xxx.xxx.xxx - - [02/Feb/2012:12:23:36 +0100] "GET /other/url HTTP/1.1" 301 579 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

我目前正在使用：tail -f access_log | grep Googlebot

这显示我全部都是谷歌爬行，但我在这里看到你也可以在日志尾部使用regexp： http://www.electrictoolbox.com/view-apache-logs-tail-grep-egrep/

对于提供更好的过滤日志方法的工具的任何其他建议都是受欢迎的。

谢谢！

4 个答案:

答案 0 :(得分：4)

怎么样

grep 'HTTP[^"]*" 200 .*Googlebot/2.1' log

答案 1 :(得分：3)

或更简单的

tail -f access_log | grep "Googlebot" | grep 200

答案 2 :(得分：1)

我确信必须有更好的东西，但这适用于您提供的示例

.+?\[.+?\] ".*?" 200 .+Googlebot.+

使用egrep：

tail access.log  | egrep '.+?\[.+?\] ".*?" 200 .+Googlebot.+'

答案 3 :(得分：1)

如果我理解正确，我会使用awk：

awk '$9 ~ /^200$/ { print $0 }' file.txt

如果您只对最近10个增长线感兴趣，可以尝试：

tail -f access_log | awk '$9 ~ /^200$/ { print $0 }'

编辑：

我本来应该更严格，试试：

awk '$9 ~ /^200$/ && $14 ~ /^Googlebot/ { print $0 }' file.txt

或

tail -f access_log | awk '$9 ~ /^200$/ && $14 ~ /^Googlebot/ { print $0 }'