Question

我需要解析HTML文件并删除除锚标记之外的所有内容。例如：

<html>
    <body>
        <p>boom</p>
        <a href="/blah" rel="no-follow">Example</a>
    </body>
</html>

我只需要保留：

<a href="/blah" rel="no-follow">Example</a>

我正在使用cURL来检索html以及我发现的一小段代码，除了标记的锚文本之外，它还会删除所有代码。这就是我正在使用的：

curl http://www.google.com 2>&1 | perl -pe 's/\<.*?\>//g'

是否有简单的命令行方式来执行此操作？我的最终目标是将其放入bash脚本并执行它。我很难理解正则表达式和perl。

Answer 1

使用Mojolicious命令行工具mojo：

mojo get http://www.google.com 'a'

输出：

<a class="gb1" href="http://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>
<a class="gb1" href="http://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a>
<a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a>
<a class="gb1" href="http://www.youtube.com/?tab=w1">YouTube</a>
<a class="gb1" href="http://news.google.com/nwshp?hl=en&amp;tab=wn">News</a>
<a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>
<a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>
<a class="gb1" href="http://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a>
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
<a class="gb4" href="/preferences?hl=en">Settings</a>
<a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;continue=http://www.google.com/" id="gb_70" target="_top">Sign in</a>
<a href="/chrome/index.html?hl=en&amp;brand=CHNG&amp;utm_source=en-hpp&amp;utm_medium=hpp&amp;utm_campaign=en" onclick="google.promos&amp;&amp;google.promos.toast&amp;&amp; google.promos.toast.cl()">Install Google Chrome</a>
<a href="/advanced_search?hl=en&amp;authuser=0">Advanced search</a>
<a href="/language_tools?hl=en&amp;authuser=0">Language tools</a>
<a href="http://www.google.com/chrome/devices/index.html" onclick="google.promos&amp;&amp;google.promos.link&amp;&amp; google.promos.link.cl()">Chromebook: For students</a>
<a href="/intl/en/ads/">Advertising Programs</a>
<a href="/services/">Business Solutions</a>
<a href="https://plus.google.com/116899029375914044550" rel="publisher">+Google</a>
<a href="/intl/en/about.html">About Google</a>
<a href="/intl/en/policies/">Privacy &amp; Terms</a>

要获得有用的8分钟介绍性视频，请查看：Mojocast Episode 5

Answer 2

使用Mojolicious，如上面的@Miller，但更准确地选择<a ... rel=：

如果您有html文件

perl -Mojo -E 'say $_ for x(b("my.html")->slurp)->find("a[rel]")->each'

或在线资源

perl -Mojo -E 'say $_ for g("http://example.com")->dom->find("a[rel]")->each'
#or
perl -Mojo -E 'g("http://example.com")->dom->find("a[rel]")->each(sub{say $_})'

Answer 3

如果您想要更精细地控制HTML，那么您可以使用CPAN上提供的HTML::TagParser模块。

use strict;
use warnings;
use HTML::TagParser;

my $html = HTML::TagParser->new( '<html>
    <body>
        <p>boom</p>
        <a href="/blah" rel="no-follow">Example</a>
    </body>
</html>' );

my @list = $html->getElementsByTagName( "a" );

for my $elem ( @list ) {
    my $name = $elem->tagName;
    my $attr = $elem->attributes;
    my $text = $elem->innerText;
    print "<$name";
    for my $key ( sort keys %$attr ) {
        print " $key=\"$attr->{$key}\"";
    }
    print $text eq "" ? " />" : ">$text</$name>" , "\n";
}

输出：

<a href="/blah" rel="no-follow">Example</a>

Answer 4

IngydötNetpQuery值得一提：

perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
 ->find("a")->each(sub{say pQuery($_)->toHtml})'

只是链接：

perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
  ->find("a")->each(sub{say $_->{href}})'

虽然 - 与mojo不同 - 没有命令行工具（即尚未 - 它不是那种工具本身并且仍处于“正在建设中”），这是您的观察名单上的一个模块。

除了完整的锚标签之外的所有内容 - Perl

4 个答案:

输出：