您好我想知道我的剧本是否合适;我希望将完整的URL作为我的Perl脚本的结果:
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $get = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
if ($get =~ m,(.*?)http:(.*?)\"\)\;\yt.preload.start\(\"(.*?)\"\)\;</script>,sgi){
print "First:$2\n\n";
print "Second:$3\n";
答案 0 :(得分:3)
我非常感谢Mojo::UserAgent为这类内容构建的DOM功能。你可以准确地提取你想要的脚本(太糟糕的YouTube没有将id
附加到他们身上):
use v5.10;
use Mojo::UserAgent;
my $script = Mojo::UserAgent->new->
get("http://www.youtube.com/watch?v=Ko0c4QT5aVA" )->
res->
dom->
find('script')->
[1];
my( $yt_preload_start ) = $script =~ m|;\s*yt\Q.preload.start(\E\s*"(.*?)"|;
$yt_preload_start =~ s{\\(.)}{$1}g;
$yt_preload_start =~ s{u0026}{&}g;
say "URL is $yt_preload_start";
我更倾向于使用JavaScript解析器来提取yt.preload.start
的参数,但我没有任何相关经验。
答案 1 :(得分:0)
它更好吗?
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $get = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
if ($get =~ m,(.*?)http:(.*?)\"\)\;\yt.preload.start\(\"(.*?)\"\)\;</script>,sgi){
my $out = $3;
$out =~ s@\\/@/@g;
$out =~ s@\\u0026@\&@g;
print "$out\n";
}
答案 2 :(得分:0)
我不清楚你的问题和代码是什么,你试图从HTML中提取。特别是,你为什么要在比赛的主要部分之前捕捉所有内容,然后忽略捕捉?
我最好的猜测是,您希望所有网址都显示为yt.preload.start
JavaScript函数的参数。你可以这样做:
use strict;
use warnings;
use LWP::UserAgent;
use URI::Escape;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $html = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
my @urls = $html =~ /\Qyt.preload.start("\E(http[^"]+)/gi;
print map uri_unescape($_)."\n", @urls;
修改强>
此解决方案使用JavaScript Unicode字符"\u0026"
保留URL,这与Perl "\N{N+0026}"
或&符号"&"
相同。该字符串也以"http:\/\/"
开头。纠正这些是很简单的。一种方法是用
map
print map {
my $ss = uri_unescape $_;
$ss =~ s/\\u0026/&/g, $ss =~ s|\\/|/|g;
$ss;
} @urls;