如何从JavaScript中提取YouTube网址?

时间:2012-03-25 15:21:15

标签: javascript perl url youtube

您好我想知道我的剧本是否合适;我希望将完整的URL作为我的Perl脚本的结果:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');

my $get = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
if ($get =~ m,(.*?)http:(.*?)\"\)\;\yt.preload.start\(\"(.*?)\"\)\;</script>,sgi){

    print "First:$2\n\n";

    print "Second:$3\n";

3 个答案:

答案 0 :(得分:3)

我非常感谢Mojo::UserAgent为这类内容构建的DOM功能。你可以准确地提取你想要的脚本(太糟糕的YouTube没有将id附加到他们身上):

use v5.10;

use Mojo::UserAgent;

my $script = Mojo::UserAgent->new->
    get("http://www.youtube.com/watch?v=Ko0c4QT5aVA" )->
    res->
    dom->
    find('script')->
    [1];

my( $yt_preload_start ) = $script =~ m|;\s*yt\Q.preload.start(\E\s*"(.*?)"|;
$yt_preload_start =~ s{\\(.)}{$1}g;
$yt_preload_start =~ s{u0026}{&}g;

say "URL is $yt_preload_start";

我更倾向于使用JavaScript解析器来提取yt.preload.start的参数,但我没有任何相关经验。

答案 1 :(得分:0)

它更好吗?

#!/usr/bin/perl
use strict;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');

my $get = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
if ($get =~ m,(.*?)http:(.*?)\"\)\;\yt.preload.start\(\"(.*?)\"\)\;</script>,sgi){
    my $out = $3;
    $out =~ s@\\/@/@g;
    $out =~ s@\\u0026@\&@g;
    print "$out\n";
}

答案 2 :(得分:0)

我不清楚你的问题和代码是什么,你试图从HTML中提取。特别是,你为什么要在比赛的主要部分之前捕捉所有内容,然后忽略捕捉?

我最好的猜测是,您希望所有网址都显示为yt.preload.start JavaScript函数的参数。你可以这样做:

use strict;
use warnings;

use LWP::UserAgent;
use URI::Escape;

my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $html = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;

my @urls = $html =~ /\Qyt.preload.start("\E(http[^"]+)/gi;
print map uri_unescape($_)."\n", @urls;

修改

此解决方案使用JavaScript Unicode字符"\u0026"保留URL,这与Perl "\N{N+0026}"或&符号"&"相同。该字符串也以"http:\/\/"开头。纠正这些是很简单的。一种方法是用

替换最终的map
print map {
  my $ss = uri_unescape $_;
  $ss =~ s/\\u0026/&/g, $ss =~ s|\\/|/|g;
  $ss;
} @urls;