Question

我目前正致力于构建一个正则表达式，它将能够提取访问网站的机器人的使用者名称。到目前为止，我已经能够使表达式匹配，但它不会返回我期望的值。请查看以下示例：

#!/usr/bin/perl

use strict; use warnings;

while (<>)
{
#Remove any unseen whitespace
chomp($_);

my $i = 0;


#Open every file in turn
open(my $domlog, "<", "$_") or die "cannot open file: $!";

#these were used for testing the open/closing of files
#print "Opened $_";
#print "Closed $_";

#for now confirm the file I'm searching through
print "Opened $_\n";

#Adding the name of the domain to the @domaind array for data processing later
push (@domain, $2) if $_ =~ m/(\/usr\/local\/apache\/domlogs\/.*\/)(.*)/;

#search through the currently opened domlog line by line
while (<$domlog>) {

#clear white space again
chomp $_;

#Print the the record in full, then print the IP address of the visitor and what should be the useragent name 
print "$_\n";
print "$1\n $2\n\n" if $_ =~ m/^(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\s(.*)\s.*(\w+[crawl|bot|spider|yahoo|bing|google])?/i;

}

close $domlog;

}

我不确定我的正则表达式是否过于贪婪，或者我是否错误地使用了通配符。任何意见，将不胜感激。谢谢。

我完全忘记了输入，因为我担心这里的代码，我在我的服务器上的一些domlog上运行脚本，这里有一些输出以及我从中得到的东西。

输入
188.165.15.208 - - [13 / Jan / 2015：09：20：49 -0500]“GET /？page_id = 2 HTTP / 1.1”200 10574“ - ”“Mozilla / 5.0（兼容; AhrefsBot / 5.0; + {{ 3}}）“

输出
188.165.15.208
- [13 / Jan / 2015：09：20：49 -0500]“GET /？page_id = 2 HTTP / 1.1”200 10574“ - ”“Mozilla / 5.0（兼容; AhrefsBot / 5.0;

输入
180.76.4.26 - - [13 / Jan / 2015：10：16：24 -0500]“GET / HTTP / 1.1”200 8744“ - ”“Mozilla / 4.0（兼容; MSIE 7.0; Windows NT 6.0）”

输出
180.76.4.26
- [13 / Jan / 2015：10：16：24 -0500]“GET / HTTP / 1.1”200 8744“ - ”“Mozilla / 4.0（兼容; MSIE 7.0; Windows NT

Answer 1

如果没有示例预期输出，我只会打开猜测您可能希望实现的目标。但是有一些事情要指出你的脚本：

push (@domain, $2) if $_ =~ m/(\/usr\/local\/apache\/domlogs\/.*\/)(.*)/;

您已经在使用m运算符，您可以使用它来更改分隔符。此外，还有(?:…)不匹配的群组，但在这种情况下，您甚至不需要这样做。裸露的正则表达式始终与$_匹配，如果它们不与=~一起使用，那么您可以摆脱它。在列表上下文中，他们返回匹配组的内容。现在全部合并：

push @domain, m~/usr/local/apache/domlogs/.*/(.*)~;

现在开始你的另一个表达。如果事情变得复杂，你应该使用/x标志，它会以很好的方式提高可读性。

.是正则表达式中的特殊字符，它匹配任何内容，因此您可能想要逃避它。此外，对于ip-address匹配，您可以使用(?:…)：

(\d{1,3}(?:\.\d{1,3}){3})

[…]匹配广告中的字符，以便

[crawl|bot|spider|yahoo|bing|google]`

可以缩减为

[abcdeghilnoprstwy|]

并会做同样的事情，这显然不是你想要的，但强调，你出错了。您可能想要的是一个不匹配的组。如果您将其设为可选项，则可能不匹配（因此请在组后删除?）。

所以这就是这个魔鬼可能看起来的结合：

if (/^(\d{1,3}(?:\.\d{1,3}){3})                  # $1 - ip address
     \s(.*)\s*                                   # $2 - within spaces
     (\w*(?:crawl|bot|spider|yahoo|bing|google)) # $3 - some bot string
    /xi){                                        # end of regex
  print ("$1\n$2\n$3\n");
}

可能仍然不是你想要的，但我不知道那是什么。您可能希望为$2非贪婪(.*?)制作论坛。如果你想在它们内部匹配，也可以逃避一些括号。

最后，看看loghack，因为有人可能已经为你完成了这项工作。

以下是相关的文档（这些是perldoc页面，因此如果您的系统上安装了perldoc，您也可以执行perldoc perlretut）：

perlretut正则教程的教程。
perlre正则表达式的文档。
perlreref如果您至少已经通过perlretut这个引用就派上用场了。

使用Perl从Apache Domlogs中提取特定的Useragent

1 个答案: