我在上一个问题中尝试了几种方法,如何解析来自HTML::TableExtract
和HTML::Parser
等网站的表信息,但它对我不起作用。以下是我的代码
my $browser = LWP::UserAgent->new( ssl_opts => { verify_hostname => 0, } );
my $url = 'http://reitdata.com/';
my $response = $browser->get($url);
die "Error at $url\n ", $response->status_line, "\n Aborting" unless $response->is_success;
my $te = HTML::TableExtract->new( headers => [qw(REIT PERIOD MKT DPU YIELD NAV GEARING ASSETS)]);
$te->parse($browser);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print join(',', @$row), "\n";
}
}
上面的代码显示没有输出。代码从网站获取表信息有什么问题吗?另外,我能以表格形式从网站输出信息吗?下面是该表的html代码。
<select name="ww" size="1" style="font-family: sans-serif; font-size: 9pt;" onchange="location.href = '/~sipesoft/cgi/sipesoft.cgi?report=ndashboard-'+ document.myform.family.value + ':' + document.myform.rpt.value + '*' + document.myform.ww.value"><option selected value="201730">201730 </option>
<option value="201729">201729 </option>
<option value="201728">201728 </option>
<option value="201727">201727 </option>
<option value="201726">201726 </option>
<option value="201725">201725 </option>
<option value="201724">201724 </option>
<option value="201723">201723 </option>
<option value="201722">201722 </option>
</tr>
<tr>
<td><hr color="#000000" size="2"></td>
</tr>
<tr>
<td>
<table border=0 align=center cellspacing=0 cellpadding=0>
<tr>
<td>
<table border=1 align=left cellspacing=3 cellpadding=2>
<tr>
<td align="center" valign="bottom" bgcolor="#C0C0C0" width="45"><b><font face="Tahoma" size="1">Name</font></b></td>
<td align="center" valign="bottom" bgcolor="#C0C0C0" width="60"><b><font face="Tahoma" size="1">Age</font></b></td>
<td align="center" valign="bottom" bgcolor="#C0C0C0" width="40"><b><font face="Tahoma" size="1">Mark<br>Count</font></b></td>
<td align="center" valign="bottom" bgcolor="#C0C0C0" width="40"><b><font face="Tahoma" size="1">Grade</font></b></td>
<td align="center" valign="bottom" bgcolor="#C0C0C0" width="40"><b><font face="Tahoma" size="1">Hobby</font></b></td>
<td align="center" valign="bottom" bgcolor="#C0C0C0" width="40"><b><font face="Tahoma" size="1">Attendence</font></b></td>
</tr>
</table>
答案 0 :(得分:3)
为了让我们处于同一页面,我们可以从此页面中提取表格
use warnings;
use strict;
use feature 'say';
use LWP::UserAgent;
use HTML::TableExtract;
my $url = 'https://stackoverflow.com/q/45452726/4653379';
my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
die "Error at $url\n ", $response->status_line if not $response->is_success;
my $page = $response->decoded_content;
my $te = HTML::TableExtract->new;
$te->parse($page);
foreach my $tbl ($te->tables) {
say "Table (", join(',', $tbl->coords), ")";
}
带输出
Table (1,0) ... Table (0,3)
这是问题中网址的表格,但需要注意。
use warnings;
use strict;
use open ':std', ':encoding(UTF-8)';
use LWP::UserAgent;
use HTML::TableExtract;
use Text::Table;
my $url = q(http://reitdata.com/);
my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
my $page = $response->decoded_content;
my @headers = qw(REIT PERIOD MKT DPU YIELD NAV GEARING ASSETS);
my $te = HTML::TableExtract->new( headers => \@headers );
$te->parse($page);
my @data;
foreach my $tbl ( ($te->tables)[1] ) { # just the second one
foreach my $row ($tbl->rows) {
my @row = map { s{^\s*|\s*$}{}gr } @$row;
push @data, \@row;
}
}
my $tb = Text::Table->new( map { $_, \' ' } @headers ); #'
$tb->load( @data );
print $tb;
map
块中的正则表达式使用非破坏性 /r
修饰符,返回更改后的字符串(原始文件保持不变) 。我们为need v5.14.0,否则请使用map { s{..}{}g; $_ }
。
使用Text::Table打印表格。好老printf
也可以做这个工作。
有关表处理的更多信息,请参阅this post和this one以及链接。
打印
REIT PERIOD MKT DPU YIELD NAV GEARING ASSETS SoilbuildBizREIT Q2 – Jun17 $0.710 1.4660 8.259% $0.720 37.90% Industrial (12) : Business Park 32% + Industrial 68% by NPI Cache Log Trust Q2 – Jun17 $0.885 1.8000 8.158% $0.770 43.40% Industrial (19) : Singapore (83%) + Australia (16%) + China (1%) by Gross Revenue Viva Ind Tr Q2 – Jun17 $0.925 1.861 8.069% $0.790 39.10% Industrial (9) : Biz Park (50.4%) + Light Industrial (23.4%) + Logistics (15.4%) + Hotel (10.8%) by NPI EC World Reit Q1 – Mar17 $0.775 1.5410 8.065% $0.900 28.60% Port, Warehouse & e-Commerce Infrastructure in China Lippo Malls Tr Q1 – Mar17 $0.460 0.890 7.739% $0.374 32.20% Retail (Indonesia) – 20 BHG Retail Reit Q1 – Mar17 $0.735 1.3900 7.565% $0.820 32.50% Retail (China) – 5 AIMSAMP Cap Reit Q1 – Jun17 $1.440 2.500 7.500% $1.386 36.30% Industrial (27) : Singapore + Australia IREIT Global Q1 – Mar17 $0.790 1.4400 7.291% $0.672 42.10% Offices : Germany (5) Sabana REIT Q2 – Jun17 $0.450 0.810 7.222% $0.560 37.00% Industrial (21) ManulifeREIT USD Q1 – Mar17 $0.920 1.6500 7.174% $0.830 34.20% Offices : USA (3) OUE Com Reit Q1 – Mar17 $0.730 1.230 6.973% $0.860 36.20% Office (82.6%) + Retail (17.4%) ; Singapore (79.9%) + China (20.1%) by Revenue OUE Htrust Q1 – Mar17 $0.755 1.3000 6.887% $0.760 38.10% Hotel (78%) + Retail (22%) by NPI Frasers Com Tr Q3 – Jun17 $1.400 2.398 6.871% $1.520 35.90% Singapore (52.7%) + Australia (47.3%) by NPI ESR-REIT Q2 – Jun17 $0.565 0.9560 6.768% $0.633 37.90% Industrial (49) Ascendas-hTrust 2H – Mar17 $0.840 3.010 6.762% $0.920 32.20% Hotels (11) : Australia (51%) + Japan (29%) + Singapore (14%) + China (6%) by NPI FHT Q3 – Jun17 $0.740 1.2374 6.689% $0.749 34.10% Hotel (9) + Serviced Apt (6) : Australia (38%) + Singapore (20%) + UK (17%) + Japan (14%) + Malaysia (6%) + Germany (5%) by NPI Mapletree GCC Tr Q1 – Jun17 $1.110 1.851 6.614% $1.244 39.40% Retail + Office : HK (69.4%) + China (30.6%) by NPI ; Retail (62%) + Office (36.5%) by NPI Ascott Reit 1H – Jun17 $1.190 3.3560 6.511% $1.190 32.40% Serviced Apts (73) : Asia Pacific (61.6%) + Europe (28.4%) + US (10%) by Assets First REIT Q2 – Jun17 $1.350 2.140 6.393% $1.004 31.00% Hospitals (13 – 1 in S Korea) + Hotel (Indonesia – 2) + Nursing Home (Singapore – 3) Mapletree Ind Tr Q1 – Jun17 $1.855 2.9200 6.296% $1.400 29.80% Industrial (86) Mapletree Log Tr Q1 – Jun17 $1.200 1.887 6.290% $1.020 39.00% Industrial (127) Far East HTrust Q1 – Mar17 $0.670 0.9300 6.239% $0.903 32.30% Hotels (65.2%) + Commercial (23.1%) + Serviced Apts (11.7%) by Revenue CapitaR China Tr 1H – Jun17 $1.660 5.360 6.078% $1.520 35.30% Retail (China) – 11 Frasers L&I Tr Q3 – Jun17 $1.095 1.7500 6.076% $0.920 29.30% Industrial (Australia) – 54 StarhillGbl Reit Q4 – Jun17 $0.780 1.180 6.064% $0.910 35.30% Retail + Office : Singapore (62.5%) + Australia (23.0%) + Malaysia (12.5%) + Others (2.0%) by Revenue CDL Htrust 1H – Jun17 $1.600 4.1000 6.031% $1.545 38.70% Hotels : Singapore (58.1%) + Australia (10.2%) + Maldives (7.6%) + NZ (14.2%) + UK (6.1%) + Japan (3.7%) by NPI Ascendas Reit Q1 – Jun17 $2.700 4.049 5.811% $2.040 33.90% Industrial (132) : Singapore (86%) + Australia (14%) by Valuation Keppel DC REIT 1H – Jun17 $1.280 3.6300 5.672% $0.931 27.70% Data Centres – 12 + 1 (Under Devt) Frasers Cpt Tr Q3 – Jun17 $2.100 3.000 5.593% $1.920 30.00% Retail (6) + 31.17% of Hektar (MREIT) CapitaMall Trust Q2 – Jun17 $2.010 2.7500 5.542% $1.910 34.70% Retail (16) + Office SPHREIT Q3 – May17 $1.000 1.370 5.520% $0.940 25.60% Retail (2) Mapletree Com Tr Q1 – Jun17 $1.605 2.2300 5.495% $1.370 36.40% Retail + Office CapitaCom Trust 1H – Jun17 $1.720 4.590 5.337% $1.770 35.20% Office (73%) + Retail (16%) + Hotel (11%) by Gross Rental Income Suntec Reit Q2 – Jun17 $1.900 2.4930 5.289% $2.094 36.10% Office (69%) + Retail (28%) + Convention (3%) by Income Fortune Reit HKD 1H – Jun17 $9.720 25.530 5.253% $13.390 28.40% Retail (HK) – 17 Keppel Reit Q2 – Jun17 $1.160 1.4200 4.897% $1.400 38.50% Office (8) : Singapore (89%) + Australia (11%) by Asset Value ParkwayLife Reit Q2 – Jun17 $2.710 3.320 4.576% $1.680 37.40% Hospitals + Nursing Homes = 49 : Singapore 60% + Japan 40% by Gross Revenue Saizen REIT 2H – Jun15 $0.033 2.930 0.000% $1.210 35.00% Residential (Japan) – 136
警告:这不是页面上的第二个表格,而是“2017年7月”部分中的表格。该模块只能看到第一个表和这个表,它与网站有什么关系。这是一个我必须暂时离开的问题。