如何使用Perl访问这个凌乱的HTML表中特定单元格的内容?

时间:2012-05-10 19:43:10

标签: perl html-parsing

我有以下凌乱的HTML表格,用于显示记录列表。

<table><tbody>                                                                                                                                                                                                                                                                       <tr id="RECORD_1">
    <td valign="top" class="summary_recnum"><input value="1" name="marked_list_candidates" type="checkbox">&nbsp;1. <div id="ml_indicator_1"> 
    </div>
    <div id="enw_link_1"> 
    </div>
    </td><td class="summary_data"><div>
    <span class="label">Title: </span><a class="smallV110" href="/full_record.do?product=UA&amp;search_mode=GeneralSearch&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=1" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">
    <value lang_id="">A Multitier System for the Verification, Visualization and Management of CHIMERA</value>
    </a>
    </div>
    <div>
    <span class="label">Author(s): </span>Lingerfelt E. J.; Messer O. E. B.; Osborne J. A.; et al.</div>
    <div>
    <span class="label">Editor(s): </span>Sato M; Matsuoka S; Sloot PMA; et al.</div>
    <div>
    <span class="label">Conference:
        </span> <span class="data_bold">
    <value>International Conference on Computational Science (ICCS) on the Ascent of Computational Excellence</value>
    </span>  <span class="label">Location: </span><span class="data_bold">Campus Nanyang Technolog Univ, Singapore, SINGAPORE</span>  <span class="label">Date: </span><span class="data_bold">2011</span>   
    <br>
    <span class="label">Sponsor(s): </span><span class="data_bold">Elsevier; Univ Tsukuba, Ctr Computat Sci</span>   
    </div>
    <span class="label">Source: </span>PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS)&nbsp;&nbsp;<span class="label">Book Series: </span><span class="data_bold">Procedia Computer Science</span> &nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">4</span> &nbsp;&nbsp;<span class="label">Pages: </span><span class="data_bold">2076-2085</span> &nbsp;&nbsp;<span class="label">DOI: </span><span class="data_bold">10.1016/j.procs.2011.04.227</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">2011</span> 
    <div>
    <span class="label">Times Cited: </span><span class="data_bold">0</span> (from All Databases) </div>
    <br>
    <div style="display: inline-block" id="links_1">
    <nobr><span id="links_openurl_1">                                               <a href="javascript:;" onclick="return open_location('OutboundService.do?action=go&amp;mode=fastOpenUrl&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;product=UA&amp;qid=2&amp;doc=1&amp;publisher_id=Oak_Ridge_National_Lab_UT_Battelle_LLC_open&amp;recordID=','openurl');" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"> <img src="http://sfx.ornl.gov/ornl/sfx.gif" border="0" alt="Context Sensitive Links" title="Context Sensitive Links"> </a>  </span><span id="links_full_text_1"> </span><span id="links_doc_del_1"> </span><span id="links_patent_1"> </span></nobr>
    </div>
    <span style="display: inline" class="ViewAbstract1_text" id="ViewAbstract1_text">
        [
        <a title="View the abstract" alt="View the abstract" onclick="return hide_show_abstract('1', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="View the abstract" alt="View the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/expand.gif" id="ViewAbstract1_img">View abstract</a>
        ]
        </span><span style="display: none" class="HideAbstract1_text" id="HideAbstract1_text">
        [
        <a title="Hide the abstract" alt="Hide the abstract" onclick="return hide_show_abstract('1', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="Hide the abstract" alt="Hide the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif" id="HideAbstract1_img">Hide abstract</a>
        ]
        </span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=UA&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=1" id="ViewAbstract_Span1">
    <!----></span></td></tr><tr id="RECORD_2">
    <td valign="top" class="summary_recnum"><input value="2" name="marked_list_candidates" type="checkbox">&nbsp;2. <div id="ml_indicator_2"> 
    </div>
    <div id="enw_link_2"> 
    </div>
    </td><td class="summary_data"><div>
    <span class="label">Title: </span><a class="smallV110" href="/full_record.do?product=UA&amp;search_mode=GeneralSearch&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=2" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">
    <value lang_id="">Gravitational waves from core collapse supernovae</value>
    </a>
    </div>
    <div>
    <span class="label">Author(s): </span>Yakunin Konstantin N.; Marronetti Pedro; <span class="hitHilite">Mezzacappa Anthony</span>; et al.</div>
    <div>
    <span class="label">Conference:
        </span> <span class="data_bold">
    <value>14th Gravitational Wave Data Analysis Workshop (GWDAW-14)</value>
    </span>  <span class="label">Location: </span><span class="data_bold">Univ Rome, Rome, ITALY</span>  <span class="label">Date: </span><span class="data_bold">JAN 26-29, 2010</span>    
    </div>
    <span class="label">Source: </span>CLASSICAL AND QUANTUM GRAVITY&nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">27</span> &nbsp;&nbsp;<span class="label">Issue: </span><span class="data_bold">19</span> &nbsp;&nbsp;<span class="label">Special Issue: </span><span class="data_bold">SI</span> &nbsp;&nbsp;&nbsp;&nbsp;<span class="label">Article Number: </span><span class="data_bold">194005</span> &nbsp;&nbsp;<span class="label">DOI: </span><span class="data_bold">10.1088/0264-9381/27/19/194005</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">OCT 7 2010</span> 
    <div>
    <span class="label">Times Cited: </span><a title="View all of the articles that cite this one" href="/CitingArticles.do?product=UA&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;search_mode=CitingArticles&amp;parentProduct=UA&amp;parentQid=2&amp;parentDoc=2&amp;REFID=337695000&amp;betterCount=7" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">7</a> (from All Databases) </div>
    <br>
    <div style="display: inline-block" id="links_2">
    <nobr><span id="links_openurl_2">                                               <a href="javascript:;" onclick="return open_location('OutboundService.do?action=go&amp;mode=fastOpenUrl&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;product=UA&amp;qid=2&amp;doc=2&amp;publisher_id=Oak_Ridge_National_Lab_UT_Battelle_LLC_open&amp;recordID=','openurl');" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"> <img src="http://sfx.ornl.gov/ornl/sfx.gif" border="0" alt="Context Sensitive Links" title="Context Sensitive Links"> </a>  </span><span id="links_full_text_2"> </span><span id="links_doc_del_2"> </span><span id="links_patent_2"> </span></nobr>
    </div>
    <span style="display: inline" class="ViewAbstract2_text" id="ViewAbstract2_text">
        [
        <a title="View the abstract" alt="View the abstract" onclick="return hide_show_abstract('2', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="View the abstract" alt="View the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/expand.gif" id="ViewAbstract2_img">View abstract</a>
        ]
        </span><span style="display: none" class="HideAbstract2_text" id="HideAbstract2_text">
        [
        <a title="Hide the abstract" alt="Hide the abstract" onclick="return hide_show_abstract('2', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="Hide the abstract" alt="Hide the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif" id="HideAbstract2_img">Hide abstract</a>
        ]
        </span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=UA&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=2" id="ViewAbstract_Span2">
    <!----></span></td></tr><tr id="RECORD_3">
    <td valign="top" class="summary_recnum"><input value="3" name="marked_list_candidates" type="checkbox">&nbsp;3. <div id="ml_indicator_3"> 
    </div>
    <div id="enw_link_3"> 
    </div>
    </td><td class="summary_data"><div>
    <span class="label">Title: </span><a class="smallV110" href="/full_record.do?product=UA&amp;search_mode=GeneralSearch&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=3" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">
    <value lang_id="">Protoneutron star evolution and the neutrino-driven wind in general relativistic neutrino radiation hydrodynamics simulations</value>
    </a>
    </div>
    <div>
    <span class="label">Author(s): </span>Fischer T.; Whitehouse S. C.; <span class="hitHilite">Mezzacappa A</span>.; et al.</div>
    <span class="label">Source: </span>ASTRONOMY &amp; ASTROPHYSICS&nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">517</span> &nbsp;&nbsp;&nbsp;&nbsp;<span class="label">Article Number: </span><span class="data_bold">A80</span> &nbsp;&nbsp;<span class="label">DOI: </span><span class="data_bold">10.1051/0004-6361/200913106</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">JUL 2010</span> 
    <div>
    <span class="label">Times Cited: </span><a title="View all of the articles that cite this one" href="/CitingArticles.do?product=UA&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;search_mode=CitingArticles&amp;parentProduct=UA&amp;parentQid=2&amp;parentDoc=3&amp;REFID=336434672&amp;betterCount=40" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">40</a> (from All Databases) </div>
    <br>
    <div style="display: inline-block" id="links_3">
    <nobr><span id="links_openurl_3">                                               <a href="javascript:;" onclick="return open_location('OutboundService.do?action=go&amp;mode=fastOpenUrl&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;product=UA&amp;qid=2&amp;doc=3&amp;publisher_id=Oak_Ridge_National_Lab_UT_Battelle_LLC_open&amp;recordID=','openurl');" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"> <img src="http://sfx.ornl.gov/ornl/sfx.gif" border="0" alt="Context Sensitive Links" title="Context Sensitive Links"> </a>  </span><span id="links_full_text_3"> </span><span id="links_doc_del_3"> </span><span id="links_patent_3"> </span></nobr>
    </div>
    <span style="display: inline" class="ViewAbstract3_text" id="ViewAbstract3_text">
        [
        <a title="View the abstract" alt="View the abstract" onclick="return hide_show_abstract('3', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="View the abstract" alt="View the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/expand.gif" id="ViewAbstract3_img">View abstract</a>
        ]
        </span><span style="display: none" class="HideAbstract3_text" id="HideAbstract3_text">
        [
        <a title="Hide the abstract" alt="Hide the abstract" onclick="return hide_show_abstract('3', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="Hide the abstract" alt="Hide the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif" id="HideAbstract3_img">Hide abstract</a>
        ]
        </span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=UA&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=3" id="ViewAbstract_Span3">
    <!----></span></td></tr><tr id="RECORD_4">
    <td valign="top" class="summary_recnum"><input value="4" name="marked_list_candidates" type="checkbox">&nbsp;4. <div id="ml_indicator_4"> 
    </div>
    <div id="enw_link_4"> 
    </div>
    </td><td class="summary_data"><div>
    <span class="label">Title: </span><a class="smallV110" href="/full_record.do?product=UA&amp;search_mode=GeneralSearch&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=4" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">
    <value lang_id="">GENERATION OF MAGNETIC FIELDS BY THE STATIONARY ACCRETION SHOCK INSTABILITY</value>
    </a>
    </div>
    <div>
    <span class="label">Author(s): </span>Endeve Eirik; Cardall Christian Y.; Budiardja Reuben D.; et al.</div>
    <span class="label">Source: </span>ASTROPHYSICAL JOURNAL&nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">713</span> &nbsp;&nbsp;<span class="label">Issue: </span><span class="data_bold">2</span> &nbsp;&nbsp;<span class="label">Pages: </span><span class="data_bold">1219-1243</span> &nbsp;&nbsp;<span class="label">DOI: </span><span class="data_bold">10.1088/0004-637X/713/2/1219</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">APR 20 2010</span> 
    <div>
    <span class="label">Times Cited: </span><a title="View all of the articles that cite this one" href="/CitingArticles.do?product=UA&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;search_mode=CitingArticles&amp;parentProduct=UA&amp;parentQid=2&amp;parentDoc=4&amp;REFID=292857312&amp;betterCount=6" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">6</a> (from All Databases) </div>
    <br>
    <div style="display: inline-block" id="links_4">
    <nobr><span id="links_openurl_4">                                               <a href="javascript:;" onclick="return open_location('OutboundService.do?action=go&amp;mode=fastOpenUrl&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;product=UA&amp;qid=2&amp;doc=4&amp;publisher_id=Oak_Ridge_National_Lab_UT_Battelle_LLC_open&amp;recordID=','openurl');" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"> <img src="http://sfx.ornl.gov/ornl/sfx.gif" border="0" alt="Context Sensitive Links" title="Context Sensitive Links"> </a>  </span><span id="links_full_text_4"> </span><span id="links_doc_del_4"> </span><span id="links_patent_4"> </span></nobr>
    </div>
    <span style="display: inline" class="ViewAbstract4_text" id="ViewAbstract4_text">
        [
        <a title="View the abstract" alt="View the abstract" onclick="return hide_show_abstract('4', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="View the abstract" alt="View the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/expand.gif" id="ViewAbstract4_img">View abstract</a>
        ]
        </span><span style="display: none" class="HideAbstract4_text" id="HideAbstract4_text">
        [
        <a title="Hide the abstract" alt="Hide the abstract" onclick="return hide_show_abstract('4', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="Hide the abstract" alt="Hide the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif" id="HideAbstract4_img">Hide abstract</a>
        ]
        </span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=UA&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=4" id="ViewAbstract_Span4">
    <!----></span></td></tr><tr id="RECORD_5">
    <td valign="top" class="summary_recnum"><input value="5" name="marked_list_candidates" type="checkbox">&nbsp;5. <div id="ml_indicator_5"> 
    </div>
    <div id="enw_link_5"> 
    </div>
    </td><td class="summary_data"><div>
    <span class="label">Title: </span><a class="smallV110" href="/full_record.do?product=UA&amp;search_mode=GeneralSearch&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=5" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true">
    <value lang_id="">Understanding Core-Collapse Supernovae</value>
    </a>
    </div>
    <div>
    <span class="label">Author(s): </span>Hix W. R.; Lentz E. J.; Baird M.; et al.</div>
    <div>
    <span class="label">Conference:
        </span> <span class="data_bold">
    <value>10th International Conference on Nucleus-Nucleus Collisions (NN2009)</value>
    </span>  <span class="label">Location: </span><span class="data_bold">Beijing, PEOPLES R CHINA</span>  <span class="label">Date: </span><span class="data_bold">AUG 16-21, 2009</span>   
    <br>
    <span class="label">Sponsor(s): </span><span class="data_bold">China Inst Atom Energy</span>   
    </div>
    <span class="label">Source: </span>NUCLEAR PHYSICS A&nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">834</span> &nbsp;&nbsp;<span class="label">Issue: </span><span class="data_bold">1-4</span> &nbsp;&nbsp;<span class="label">Pages: </span><span class="data_bold">602C-607C</span> &nbsp;&nbsp;<span class="label">DOI: </span><span class="data_bold">10.1016/j.nuclphysa.2010.01.104</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">MAR 1 2010</span> 
    <div>
    <span class="label">Times Cited: </span><span class="data_bold">0</span> (from All Databases) </div>
    <br>
    <div style="display: inline-block" id="links_5">
    <nobr><span id="links_openurl_5">                                               <a href="javascript:;" onclick="return open_location('OutboundService.do?action=go&amp;mode=fastOpenUrl&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;product=UA&amp;qid=2&amp;doc=5&amp;publisher_id=Oak_Ridge_National_Lab_UT_Battelle_LLC_open&amp;recordID=','openurl');" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"> <img src="http://sfx.ornl.gov/ornl/sfx.gif" border="0" alt="Context Sensitive Links" title="Context Sensitive Links"> </a>  </span><span id="links_full_text_5"> </span><span id="links_doc_del_5"> </span><span id="links_patent_5"> </span></nobr>
    </div>
    <span style="display: inline" class="ViewAbstract5_text" id="ViewAbstract5_text">
        [
        <a title="View the abstract" alt="View the abstract" onclick="return hide_show_abstract('5', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="View the abstract" alt="View the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/expand.gif" id="ViewAbstract5_img">View abstract</a>
        ]
        </span><span style="display: none" class="HideAbstract5_text" id="HideAbstract5_text">
        [
        <a title="Hide the abstract" alt="Hide the abstract" onclick="return hide_show_abstract('5', 'http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif', 'http://images.webofknowledge.com/WOKRS56B5/images/expand.gif', 'View the abstract', 'Hide the abstract');" href="javascript:;" oncontextmenu="javascript:return IsAllowedRightClick(this);" hasautosubmit="true"><img align="absmiddle" title="Hide the abstract" alt="Hide the abstract" src="http://images.webofknowledge.com/WOKRS56B5/images/collapse.gif" id="HideAbstract5_img">Hide abstract</a>
        ]
        </span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=UA&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;doc=5" id="ViewAbstract_Span5">
    <!----></span></td></tr>
    <input type="hidden" name="all_summary_IDs" value=""><input type="hidden" name="viewAbstractUrl" value="http://apps.webofknowledge.com/ViewAbstract.do?product=UA&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=2&amp;SID=2DI1PEg5Ja24IHi95Fc&amp;page=1&amp;"> <input type="hidden" name="LinksAreAllowedRightClick" value="full_record.do"> <input type="hidden" name="LinksAreAllowedRightClick" value="CitingArticles.do"> <input type="hidden" name="LinksAreAllowedRightClick" value="CitedPatent.do">
     </tbody></table>

我对每行td.summary_data的内容感兴趣,并尝试使用HTML::TableExtract解析表:

my $te = HTML::TableExtract->new(headers => ["Title"]);
$te->parse($html_string);
# Examine all matching tables
my $count = 1;
foreach my $ts ($te->tables) {
    #print "\n";
    #print "Table (", join(',', $ts->coords), "):\n";
    foreach my $row ($ts->rows) {
        print "$count\n";
        for my $cell (@$row) {
           $cell =~ s/^\s+//;
           $cell =~ s/\s+\z/;/;
           $cell =~ s/\s+/ /g;
        }
        print join("|", @$row), "\n";
        print "\n";
        $count++;
    }
}

结果:

1
Use of uninitialized value $cell in substitution (s///) at test2.pl line 20.
Use of uninitialized value $cell in substitution (s///) at test2.pl line 21.
Use of uninitialized value $cell in substitution (s///) at test2.pl line 22.
Use of uninitialized value $row in join or string at test2.pl line 24.


2
Title: Extreme Scaling of Production Visualization Software on Diverse Architectures Author(s): Childs Hank; Pugmire David; Ahern Sean; et al. Source: IEEE COMPUTER GRAPHICS AND APPLICATIONS??Volume: 30 ??Issue: 3 ??Pages: 22-31 ??Published: MAY-JUN 2010 Times Cited: 2 (from All Databases);

3
Title: Coupling visualization and data analysis for knowledge discovery from multi-dimensional scientific data Author(s): Ruebel Oliver; Ahern Sean; Bethel E. Wes; et al. Book Author(s): Sloot, PMA; Albada, GDV; Dongarra, J Book Group Author(s): ICCS Conference: International Conference on Computational Science (ICCS) Location: Univ Amsterdam, Amsterdam, NETHERLANDS Date: MAY 31-JUN 02, 2010 Sponsor(s): NWO, Netherlands Org Sci Res; KNAW, Royal Netherlands Acad Arts & Sci; Elsevier B V; Univ Amsterdam Source: ICCS 2010 - INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, PROCEEDINGS??Book Series: Procedia Computer Science ??Volume: 1 ??Issue: 1 ??Pages: 1751-1758 ??DOI: 10.1016/j.procs.2010.04.197 ??Published: 2010 Times Cited: 0 (from All Databases) [ View abstract ] [ Hide abstract ];

如何在此表的每一行中获取td.summary_data的内容,以便我可以提取我感兴趣的信息?

1 个答案:

答案 0 :(得分:3)

你的桌子没有标题。它不是一张桌子。该页面的作者使用表格进行布局。但是,您仍然可以提取所需的信息。只是当表格被布置为可视化格式而不是表格显示数据时,细节HTML::TableExtract将不可用。

#!/usr/bin/env perl

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(file => 'tt.html');

while (my $tag = $parser->get_tag('td')) {
    my $class = $tag->get_attr('class');
    next unless defined $class;
    next unless $class eq 'summary_data';

    my $text = $parser->get_text('/td');

    # do something with the contents of the table cell here
    process_record( \$text );
}

sub process_record {

}

我取出了standard preamble,因为我不确定您的输入编码是什么,但请确保在创建$parser之前正确设置了流。