Question

我有一个非常奇怪的问题：我在html网站上搜索网址，只想要网址的特定部分。在我的测试html页面中，链接只出现一次，但不是一个结果，我得到大约20 ...

这是我的正则表达式使用：

perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'

示例输入将是这样的：

<html><body><a href="http://myurl.com/somefile.php&x=foo?y=bla?z=sdf?path=/foo/bar/afolder/testfile.zip?more=arguments?and=evenmore">Somelinknme</a></body></html>

这是一个非常简单的例子。所以实际上这个链接会在一个普通的网站上发布，其内容是......

我的结果应该是这样的：

testfile.zip

但我经常看到这句话......这是正则表达式还是其他问题？

Answer 1

是的，正则表达式是greedy。

使用适当的HTML工具：HTML::LinkExtor或其中一个link methods in WWW::Mechanize，然后URI来提取特定部分。

use 5.010;
use WWW::Mechanize qw();
use URI qw();
use URI::QueryParam qw();

my $w = WWW::Mechanize->new;
$w->get('file:///tmp/so10549258.html');
for my $link ($w->links) {
    my $u = URI->new($link->url);
    # 'http://myurl.com/somefile.php?x=foo&y=bla&z=sdf&path=/foo/bar/afolder/testfile.zip&more=arguments&and=evenmore'
    say $u->query_param('path');
    # '/foo/bar/afolder/testfile.zip'
    $u = URI->new($u->query_param('path'));
    say (($u->path_segments)[-1]);
    # 'testfile.zip'
}

Answer 2

链接后文件中是否有20行？

您的问题是匹配变量未重置。您第一次匹配链接，$1和$2获取其值。在以下行中，正则表达式不匹配，但$1和$2仍然具有旧值，因此，只有在正则表达式匹配时才打印，而不是每次都打印。< / p>

从perlre开始，请参阅捕获论坛

部分

注意：Perl中的失败匹配不会重置匹配变量，这样可以更轻松地编写测试一系列更具体案例的代码，并记住最佳匹配。

Answer 3

这应该可以解决您的样本输入问题。输出

$Str = '<html><body><a href="http://myurl.com/somefile.php&x=foo?y=bla?z=sdf?path=/foo/bar/afolder/testfile.zip?more=arguments?and=evenmore">Somelinknme</a></body></html>';

@Matches = ($Str =~ m#path=.+/(\w+\.\w+)#g);
print @Matches ;

正则表达式在文本中获得更多结果

3 个答案: