如何使用perl在html中使用某些特定属性转到特定标记?

时间:2016-04-04 05:30:17

标签: perl

我尝试过使用Table Extractor但是我没有得到我需要的东西..

从这个(http://finance.yahoo.com/q/pr?s=yhoo)html文件我需要获得td中存在的公司地址,其中class =" yfnc_modtitlew1"在表中id =" yfncsumtab"

use LWP::Simple;
use HTML::Parser;
use HTML::TokeParser;
use HTML::TableExtract;

my $file= "Tickers\\EBAY\\EBAY_profile.html";

my $te = HTML::TableExtract->new( attribs => { id => "yfncsumtab" });
$te->parse_file($file);
$te->tables;
my  @arr;
foreach my $row ($te->rows) 
{
   push @arr,@$row;
   print @$row ;
}

提前致谢

3 个答案:

答案 0 :(得分:1)

一种方法是通过出色的XML::LibXML module。虽然您可能认为该模块只接受格式良好的XML,但事实证明 - 正如 Grant McLean 中所解释的那样同样出色tutorial - ;

实际上,解析器有一个HTML模式,可以处理像<img><br>这样的未关闭标记,甚至能够从格式不正确的HTML导致的解析错误中恢复。

为此,您将一个true值传递给构造函数上的recover标志;

my $url = 'http://finance.yahoo.com/q/pr?s=yhoo' ;
my $dom = XML::LibXML->load_html(
    location => $url ,
    recover  => 1  ,
);
say $dom->toStringHTML();

可以使用上面的代码段来验证是否成功获取了URL。这将在STDERR上产生任何错误,因此在验证您已获得数据后,再次运行它将输出重定向到空设备./script > /dev/null,以便您可以看到LibXML在HTML中找到了哪些错误。

完成此操作并让自己感到满意之后,没有任何重要内容被传递过来,您可以将suppress_errors标志添加到构造函数中并使用XPATH查询来提取您所追求的数据;

use v5.12;
use XML::LibXML;

my $url      = 'http://finance.yahoo.com/q/pr?s=yhoo' ;
my $table_id = "yfncsumtab" ;
my $td_class = "yfnc_modtitlew1" ;
my $xpath    = sprintf  '//table[@id="%s"]//td[contains(@class, "%s")]' ,
                        $table_id , $td_class ;

my $dom = XML::LibXML->load_html(
    location => $url ,
    recover  => 1  ,
    suppress_errors => 1
);
# say $dom->toStringHTML();

for my $td ($dom->findnodes($xpath)) {
    say $_->textContent  for $td->childNodes ;
}

除了详细解释这一点之外,我还不能再次向您推荐格兰特的教程。

答案 1 :(得分:1)

我不认为您的表的设置方式使TableExtract成为正确的工具。 TableExtract实际上将表格转换为Excel电子表格,然后使用行和/或列号检索数据,例如cell(0, 4)。您无法通过其id属性选择行。

TableExtract适用于以下表格:

            Q1      Q2    Q3
teamA       10      20    30
teamB       40      50    60
teamC       70      80    90

事实上,雅虎html程序员滥用表格来表示非表格数据 - 但雅虎是老派。

但是,您定位的行是表格的第二行,因此您可以在您的html上使用TableExtract,如下所示:

use strict;
use warnings; 
use 5.020;
use LWP::Simple;
use HTML::TokeParser::Simple;
use HTML::TableExtract;
use Data::Dumper;

my $url = "http://finance.yahoo.com/q/pr?s=yhoo";
my $html_string = get($url) or die "Couldn't download webpage!";

my $target_table_id = "yfncsumtab";

my $table_extractor = HTML::TableExtract->new(
    attribs => { id => $target_table_id },
);

$table_extractor->parse($html_string);

my $table = $table_extractor->first_table_found()
    or die "No matching tables!";

my $text = $table->cell(1, 0);  #second row, first column
my @lines = split "\n", $text, 5; 

for my $line (splice @lines, 0, 4) {
    say $line;
}

__END__
Yahoo! Inc.
701 First Avenue
Sunnyvale, 
        CA 94089

答案 2 :(得分:0)

您需要遍历表格,然后遍历行:

<?php
/*Purchase (basic)
In the purchase example we require several variables (store_id, api_token, order_id, amount, pan, expdate, and
crypt_type). Please refer to Appendix A. Definition of Request Fields for variable definitions.*/

// ------ Requires the actual API file.
// ------ the proper path
This can be placed anywhere as long as you indicate
require "../mpgClasses.php";
// ------ Define all the required variables.
These can be passed by whatever means you wish
$store_id = ‘store1’;
$customerid = ‘student_number’;
$api_token = ‘yesguy’;
$orderid = ‘need_unique_orderid’;
$pan = ‘5454545454545454’;
$amount = ’12.00’;
$expirydate = ‘0612’;
$crypttype = ‘7’;
// ------ step 1) create transaction hash
$txnArray=array(‘type’=>'purchase',
‘order_id’=>$orderid,
‘cust_id’=>$customerid,
‘amount’=>$amount,
‘pan’=>$pan,
‘expdate’=>$expirydate,
‘crypt_type’=>$crypttype
);
// ------ step 2) create a transaction object passing the hash created in step 1.
$mpgTxn = new mpgTransaction($txnArray);
// ------ step 3) create a mpgRequest object passing the transaction object created in step 2
$mpgRequest = new mpgRequest($mpgTxn);
// ------ step 4) create mpgHttpsPost object which does an https post
$mpgHttpPost =new mpgHttpsPost($store_id,$api_token,$mpgRequest);
// ------ step 5) get an mpgResponse object
$mpgResponse=$mpgHttpPost->getMpgResponse();
// ------ step 6) retrieve data using get methods. Using these methods you can retrieve the
// ------ appropriate variables (getResponseCode) to check if the transactions is approved
// ------ (=>0 or <50) or declined (>49) or incomplete (NULL)
print ("\nCardType = " . $mpgResponse->getCardType());
print("\nTransAmount = " . $mpgResponse->getTransAmount());
print("\nTxnNumber = " . $mpgResponse->getTxnNumber());
print("\nReceiptId = " . $mpgResponse->getReceiptId());
print("\nTransType = " . $mpgResponse->getTransType());
print("\nReferenceNum = " . $mpgResponse->getReferenceNum());
print("\nResponseCode = " . $mpgResponse->getResponseCode());
print("\nISO = " . $mpgResponse->getISO());
print("\nMessage = " . $mpgResponse->getMessage());
print("\nAuthCode = " . $mpgResponse->getAuthCode());
print("\nComplete = " . $mpgResponse->getComplete());
print("\nTransDate = " . $mpgResponse->getTransDate());
print("\nTransTime = " . $mpgResponse->getTransTime());
print("\nTicket = " . $mpgResponse->getTicket());
print("\nTimedOut = " . $mpgResponse->getTimedOut());
/********* End of Code *********/
?>