我试图弄清楚为什么这些代码会在某些网站上运行。这是一个工作版本:
my $url = "http://www.bbc.co.uk/news/uk-36263685";
`curl -L '$url' > ./foo.txt`;
my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
$html = join "\n", <READPAGE>;
close(READPAGE);
# works ok with the BBC page, and almost all others
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
print qq|FOO: got header...\n|;
}
..然后这个已损坏的版本,似乎锁定了:(完全相同的代码 - 只是一个不同的网址)
my $url = "http://www.sport.pl/euro2016/1,136510,20049098,euro-2016-polsat-odkryl-karty-24-mecze-w-kanalach-otwartych.html";
`curl -L '$url' > ./foo.txt`;
my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
$html = join "\n", <READPAGE>;
close(READPAGE);
# Locks up with this regex. Just seems to be some pages it does it on
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
print qq|FOO: got header...\n|;
}
我无法解决最新情况。有任何想法吗?
谢谢!
更新:对于任何有兴趣的人,我最终离开了我用来提取信息的Perl模块,并寻求更强大的HTML :: Parser方法。这是模块,如果有人想用它作为基础:
package MetaExtractor;
use base "HTML::Parser";
use Data::Dumper;
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "img") {
#print Dumper($tag,$attr);
if ($attr->{src} =~ /\.(jpe?g|png)/i) {
$attr->{src} =~ s|^//|http://|i; # fix urls like //foo.com
push @{$Links::COMMON->{images}}, $attr->{src};
}
}
if ($tag =~ /^meta$/i && $attr->{'name'} =~ /^description$/i) {
# set if we find <META NAME="DESCRIPTION"
$Links::COMMON->{META}->{description} = $attr->{'content'};
} elsif ($tag =~ /^title$/i && !$Links::COMMON->{META}->{title}) {
$Links::COMMON->{META}->{title_flag} = 1;
} elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:description$/i) {
$Links::COMMON->{META}->{og_desc} = $attr->{content}
} elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:image$/i) {
$Links::COMMON->{META}->{og_image} = $attr->{content}
} elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:description$/i) {
$Links::COMMON->{META}->{tw_desc} = $attr->{content}
} elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:image:src$/i) {
$Links::COMMON->{META}->{tw_image} = $attr->{content}
}
}
sub text {
my ($self, $text) = @_;
# If we're in <H1>...</H1> or <TITLE>...</TITLE>, save text
if ($Links::COMMON->{META}->{title_flag}) { $Links::COMMON->{META}->{title} .= $text; }
}
sub end {
my ($self, $tag, $origtext) = @_;
#print qq|END TAG: '$tag'\n|;
# reset appropriate flag if we see </H1> or </TITLE>
if ($tag =~ /^title$/i) { $Links::COMMON->{META}->{title_flag} = 0; }
}
它将提取:
标题 元描述(不是元关键字,但它足够简单易用) FB图像 FB说明 Twitter图片 Twitter描述 找到的所有图像(它没有做任何与它们相关的事情...即具有相对URL的页面......但是我会在时间允许的情况下玩这个图片)
只需拨打:
my $html;
open (READPAGE,"<:encoding(UTF-8)","/home/aycrca/public_html/cgi-bin/admin/tmp/$now.txt");
my $p = new MetaExtractor;
while (<READPAGE>) {
$p->parse($_);
}
$p->eof;
close(READPAGE);
答案 0 :(得分:4)
它不是一个无限循环,它只是很慢。它也会查找 public SoapObject soap(String METHOD_NAME, String SOAP_ACTION, String NAMESPACE, String URL,String IP,String SERVICEPATH) throws IOException, XmlPullParserException
{
SoapObject request = new SoapObject(NAMESPACE, METHOD_NAME); //set up request
PropertyInfo pi = new PropertyInfo();
pi.setName("Parameter1");
pi.setValue(Value1);
request.addProperty(pi);
pi.setName("Parameter2");
pi.setValue(Value2);
request.addProperty(pi);
SoapSerializationEnvelope envelope = new SoapSerializationEnvelope(SoapEnvelope.VER11); // put all required data into a soap
envelope.dotNet = true;
envelope.setOutputSoapObject(request); // prepare request
envelope.bodyOut = request;
Log.d("ENVELOPE",""+"Coming3");
HttpTransportSE androidHttpTransport = new HttpTransportSE(URL);
//androidHttpTransport.
androidHttpTransport.call(SOAP_ACTION, envelope);
Log.d("ENVELOPE",""+envelope.bodyIn);
SoapObject result = (SoapObject) envelope.bodyIn; // get response
Log.d("ENVELOPE",""+envelope.bodyIn);
SoapObject responseBodyRaw,responseBody,tableRow;
return result;
}
个标签,并且每个标签都需要查看文件的其余部分,以查找结尾<header>
标记(不在那里)。将其更改为:
</head>
将非utf8文件视为utf8,问题似乎更加严重。
答案 1 :(得分:4)
您找到了灾难性回溯的实例(q.v。)
即使对于那些正则表达式模式有效的网站,匹配也会非常冗长且占用大量CPU资源。您应该尽可能避免.*?
并使用否定的字符类
如果你使用它,一切都应该是
$html =~ m| <head\b[^<>]*> (.*) </head> |gisx
<head.*?>
应该只匹配一个HTML标记,但是没有什么可以阻止正则表达式引擎直接搜索到文件的末尾。将其更改为<head[^<>]*>
只会允许它在head
之后匹配非角括号,如果有的话,它只会是几个字符
捕获的表达式不太简单,因为您可能希望匹配<head>
元素中包含的标记,因此否定的字符类不会起作用。然而,灾难性回溯几乎总是多个通配符同时起作用的结果,因此来自一个通配符的每个可能匹配必须与来自另一个通配符的每个可能匹配匹配,从而导致指数复杂性。只剩下一个通配符,正则表达式应该可以正常工作
另请注意,我为正则表达式使用了替代分隔符,因此斜杠不需要转义,我在\b
之后添加了单词边界<head
以防止它匹配<header
或类似的