我有这个:
<table border="1" cellspacing="1" cellpadding="0">
<tbody>
<tr><th class="align-left" style="text-align: left;">Name</th><th>Type</th><th>Size</th><th>Values</th><th>Description</th><th>Attributes</th><th>Default</th></tr>
<tr>
<td>E-mail</td>
<td>text</td>
<td>60</td>
<td>test@test.com</td>
<td> </td>
<td>M</td>
<td>test@test.com</td>
</tr>
<tr>
<td>Phone</td>
<td>text</td>
<td>20</td>
<td>01-250 481 00</td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
这是代码的样子:
我想使用regex / regexp从(名称)的左边基于(值)的信息中提取信息,但是我不知道这是否可行...
例如,我要搜索“电话”并获得“ 01-250 481 00”
你怎么看?
答案 0 :(得分:5)
不要使用正则表达式来解析HTML。使用HTML解析器将HTML转换为DOM树。然后在DOM域中执行操作。例如
use HTML::TreeParser;
my $parser = HTML::TreeParser->new;
my $root = $parser->parse_content($html_string);
my $table = $root->look_down(_tag => 'table');
my @rows = $table->look_down(_tag => 'tr');
for my $row (@rows) {
# perform your row operation here using HTML::Element methods
# search, replace, insert, modify content...
my @columns = $row->look_down(_tag => 'tr');
# we need 1st (Name) and 4th (Values) column
if (@columns >= 4) {
if ($column[0]->as_trimmed_text() eq "Phone") {
my $number = $column[3]->as_trimmed_text();
...
}
}
}
# if you need to dump the modified tree again...
print $root->as_HTML();
# IMPORTANT: must be done after done with DOM tree!
$root->delete();
答案 1 :(得分:2)
一个Mojo::DOM选项:
use strict;
use warnings;
use Mojo::DOM;
use List::Util 'first';
my $dom = Mojo::DOM->new($html);
my $query = 'phone';
my @cols = $dom->at('tr')->find('th')->map('text')->each;
my $name_col = 1 + first { $cols[$_] eq 'Name' } 0..$#cols;
my $values_col = 1 + first { $cols[$_] eq 'Values' } 0..$#cols;
my $row = $dom->find('tr')->first(sub {
my $name = $_->at("td:nth-of-type($name_col)");
defined $name and $name->text =~ m/\Q$query\E/i;
});
if (defined $row) {
my $values = $row->at("td:nth-of-type($values_col)")->text;
print "Values: $values\n";
} else {
print "Not found\n";
}
答案 2 :(得分:1)
使用正则表达式解析HTML是一个糟糕的主意。解析HTML时,我目前选择的武器是Web::Query。
我的方法是将表解析为合适的数据结构,然后让您提取所需的数据。
也许是这样...
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
use Path::Tiny;
use Web::Query;
# Get the HTML. I'm reading it from a file, you might
# need to make an HTTP request.
my $html = path('table.html')->slurp;
# Parse the HTML.
my $wq = wq($html);
# Extract the text from all of the <th> elements.
# These will be the keys of our hash.
my @cols;
$wq->find('th')->each(sub { push @cols, $_->text });
# A hash to store our data
my %data;
# Find all of the <tr> elements in the HTML
$wq->find('tr')->each(sub {
# Each table row will be a sub-hash in our hash
my %rec;
# Find each <td> element in the row.
# For each element, get the text and match it with the column header.
# Store the key/value pair in a hash.
$_->find('td')->each(sub {
my ($i, $elem) = @_;
my $key = $cols[$i];
$rec{$key} = $elem->text;
});
# If we have data, then store it in the main hash.
$data{$rec{Name}} = \%rec if $rec{Name};
});
# Show what we've made.
say Dumper \%data;
# A sample query which extracts all of the values
# from the data structure.
for (keys %data) {
say "$_ is $data{$_}{Values}";
}