perl去除单词周围的span标签

时间:2017-03-26 15:06:11

标签: perl sed replace

我试图剥离带有以0或1开头的字母间距的span标签。

'<span style="letter-spacing:0.50 px">Boulevard,</span> '
to equal
'Boulevard, '

谢谢

这是一个完整系列的例子。

<span style="letter-spacing:1.33 px">PRODUCTS</span> <span style="letter-spacing:1.37 px">MODEL</span> <span style="letter-spacing:0.77 px">HPI-27C</span> <span style="letter-spacing:1.39 px">MODDED)</span> ; <span style="letter-spacing:1.12 px">(HIGHWAY</span> <span style="letter-spacing:1.33 px">PRODUCTS</span> <span style="letter-spacing:1.37 px">MODEL</span>

需要最终像

产品型号HPI-27C MODDED); (公路产品型号

3 个答案:

答案 0 :(得分:1)

以下是使用Perl和HTML::Parser的示例:

use strict;
use warnings;
use HTML::Parser ();
my $delete_tag = 0;

my $p = HTML::Parser->new(
    api_version => 3,
    default_h => [sub { print shift }, 'text'],
    start_h => [\&start_handler, 'tagname,text,attr'],
    end_h => [\&end_handler, 'tagname,text'],    
);

my $str = do { local $/; <DATA> };
$p->parse($str) || die $!;
print "\n";

sub end_handler {
    my ( $tag, $text ) = @_;
    if ( $tag eq "span" ) {
        if ($delete_tag) {
            $delete_tag = 0;
            return;
        }
    }
    print $text;
}

sub start_handler {
    my ( $tag, $text, $attr ) = @_;
    if ( $tag eq "span" ) {
        if ($attr->{style} =~ /letter-spacing:[01]\./) {
            $delete_tag = 1;
            return;
        }
    }
    print $text;
}

__DATA__
<span style="letter-spacing:1.33 px">PRODUCTS</span> <span style="letter-spacing:1.37 px">MODEL</span> <span style="letter-spacing:0.77 px">HPI-27C</span> <span style="letter-spacing:1.39 px">MODDED)</span> ; <span style="letter-spacing:1.12 px">(HIGHWAY</span> <span style="letter-spacing:1.33 px">PRODUCTS</span> <span style="letter-spacing:1.37 px">MODEL</span>

<强>输出

PRODUCTS MODEL HPI-27C MODDED) ; (HIGHWAY PRODUCTS MODEL

答案 1 :(得分:0)

Perl oneliners:

1。)使用Mojo::DOM58模块

perl -0777 -MMojo::DOM58 -E '$d=Mojo::DOM58->new(<>);$d->find("span")->grep(qr/letter-spacing:[01]/)->map(sub{$_->strip});print "$d"' <file.html

2。)或者,如果您安装了Mojolicious,则可以将ojo模块用作:

perl -Mojo -E '$d=x(f("file.html")->slurp);$d->find("span")->grep(qr/letter-spacing:[01]/)->map(sub{$_->strip});print "$d"'

两个例子都打印出来:

PRODUCTS MODEL HPI-27C MODDED) ; (HIGHWAY PRODUCTS MODEL

答案 2 :(得分:-1)

如果您发布了1个样本行,那么您的要求并不完整:

.js

以上内容适用于支持"rules": { "react/jsx-filename-extension": [1, { "extensions": [".js", ".jsx"] }], } ERE的任何sed,例如: GNU sed和OSX sed。

鉴于您更新的样本输入/输出,这将使用GNU awk实现多字符RS和RT的所需:

$ sed -E 's#<span[^>]+letter-spacing:[01][^>]+>(.*)</span>#\1#' file
'Boulevard, '