我尝试过使用Table Extractor但是我没有得到我需要的东西..
从这个(http://finance.yahoo.com/q/pr?s=yhoo)html文件我需要获得td中存在的公司地址,其中class =" yfnc_modtitlew1"在表中id =" yfncsumtab"
use LWP::Simple;
use HTML::Parser;
use HTML::TokeParser;
use HTML::TableExtract;
my $file= "Tickers\\EBAY\\EBAY_profile.html";
my $te = HTML::TableExtract->new( attribs => { id => "yfncsumtab" });
$te->parse_file($file);
$te->tables;
my @arr;
foreach my $row ($te->rows)
{
push @arr,@$row;
print @$row ;
}
提前致谢
答案 0 :(得分:1)
一种方法是通过出色的XML::LibXML module。虽然您可能认为该模块只接受格式良好的XML,但事实证明 - 正如 Grant McLean 中所解释的那样同样出色tutorial - ;
实际上,解析器有一个HTML模式,可以处理像<img>
和<br>
这样的未关闭标记,甚至能够从格式不正确的HTML导致的解析错误中恢复。
为此,您将一个true值传递给构造函数上的recover标志;
my $url = 'http://finance.yahoo.com/q/pr?s=yhoo' ;
my $dom = XML::LibXML->load_html(
location => $url ,
recover => 1 ,
);
say $dom->toStringHTML();
可以使用上面的代码段来验证是否成功获取了URL。这将在STDERR上产生任何错误,因此在验证您已获得数据后,再次运行它将输出重定向到空设备./script > /dev/null
,以便您可以看到LibXML在HTML中找到了哪些错误。
完成此操作并让自己感到满意之后,没有任何重要内容被传递过来,您可以将suppress_errors
标志添加到构造函数中并使用XPATH查询来提取您所追求的数据;
use v5.12;
use XML::LibXML;
my $url = 'http://finance.yahoo.com/q/pr?s=yhoo' ;
my $table_id = "yfncsumtab" ;
my $td_class = "yfnc_modtitlew1" ;
my $xpath = sprintf '//table[@id="%s"]//td[contains(@class, "%s")]' ,
$table_id , $td_class ;
my $dom = XML::LibXML->load_html(
location => $url ,
recover => 1 ,
suppress_errors => 1
);
# say $dom->toStringHTML();
for my $td ($dom->findnodes($xpath)) {
say $_->textContent for $td->childNodes ;
}
除了详细解释这一点之外,我还不能再次向您推荐格兰特的教程。
答案 1 :(得分:1)
我不认为您的表的设置方式使TableExtract成为正确的工具。 TableExtract实际上将表格转换为Excel电子表格,然后使用行和/或列号检索数据,例如cell(0, 4)
。您无法通过其id属性选择行。
TableExtract适用于以下表格:
Q1 Q2 Q3
teamA 10 20 30
teamB 40 50 60
teamC 70 80 90
事实上,雅虎html程序员滥用表格来表示非表格数据 - 但雅虎是老派。
但是,您定位的行是表格的第二行,因此您可以在您的html上使用TableExtract,如下所示:
use strict;
use warnings;
use 5.020;
use LWP::Simple;
use HTML::TokeParser::Simple;
use HTML::TableExtract;
use Data::Dumper;
my $url = "http://finance.yahoo.com/q/pr?s=yhoo";
my $html_string = get($url) or die "Couldn't download webpage!";
my $target_table_id = "yfncsumtab";
my $table_extractor = HTML::TableExtract->new(
attribs => { id => $target_table_id },
);
$table_extractor->parse($html_string);
my $table = $table_extractor->first_table_found()
or die "No matching tables!";
my $text = $table->cell(1, 0); #second row, first column
my @lines = split "\n", $text, 5;
for my $line (splice @lines, 0, 4) {
say $line;
}
__END__
Yahoo! Inc.
701 First Avenue
Sunnyvale,
CA 94089
答案 2 :(得分:0)
您需要遍历表格,然后遍历行:
<?php
/*Purchase (basic)
In the purchase example we require several variables (store_id, api_token, order_id, amount, pan, expdate, and
crypt_type). Please refer to Appendix A. Definition of Request Fields for variable definitions.*/
// ------ Requires the actual API file.
// ------ the proper path
This can be placed anywhere as long as you indicate
require "../mpgClasses.php";
// ------ Define all the required variables.
These can be passed by whatever means you wish
$store_id = ‘store1’;
$customerid = ‘student_number’;
$api_token = ‘yesguy’;
$orderid = ‘need_unique_orderid’;
$pan = ‘5454545454545454’;
$amount = ’12.00’;
$expirydate = ‘0612’;
$crypttype = ‘7’;
// ------ step 1) create transaction hash
$txnArray=array(‘type’=>'purchase',
‘order_id’=>$orderid,
‘cust_id’=>$customerid,
‘amount’=>$amount,
‘pan’=>$pan,
‘expdate’=>$expirydate,
‘crypt_type’=>$crypttype
);
// ------ step 2) create a transaction object passing the hash created in step 1.
$mpgTxn = new mpgTransaction($txnArray);
// ------ step 3) create a mpgRequest object passing the transaction object created in step 2
$mpgRequest = new mpgRequest($mpgTxn);
// ------ step 4) create mpgHttpsPost object which does an https post
$mpgHttpPost =new mpgHttpsPost($store_id,$api_token,$mpgRequest);
// ------ step 5) get an mpgResponse object
$mpgResponse=$mpgHttpPost->getMpgResponse();
// ------ step 6) retrieve data using get methods. Using these methods you can retrieve the
// ------ appropriate variables (getResponseCode) to check if the transactions is approved
// ------ (=>0 or <50) or declined (>49) or incomplete (NULL)
print ("\nCardType = " . $mpgResponse->getCardType());
print("\nTransAmount = " . $mpgResponse->getTransAmount());
print("\nTxnNumber = " . $mpgResponse->getTxnNumber());
print("\nReceiptId = " . $mpgResponse->getReceiptId());
print("\nTransType = " . $mpgResponse->getTransType());
print("\nReferenceNum = " . $mpgResponse->getReferenceNum());
print("\nResponseCode = " . $mpgResponse->getResponseCode());
print("\nISO = " . $mpgResponse->getISO());
print("\nMessage = " . $mpgResponse->getMessage());
print("\nAuthCode = " . $mpgResponse->getAuthCode());
print("\nComplete = " . $mpgResponse->getComplete());
print("\nTransDate = " . $mpgResponse->getTransDate());
print("\nTransTime = " . $mpgResponse->getTransTime());
print("\nTicket = " . $mpgResponse->getTicket());
print("\nTimedOut = " . $mpgResponse->getTimedOut());
/********* End of Code *********/
?>