Question

我正在使用一些混乱的html解析一个网站，它们是130个子网站，唯一一个失败的是最后一个。失败的部分是粗体。当我得到3（父母和2个孩子）时，我得到一个空列表。所有网站都有相同的结构，所以我不知道如何解决这个问题。

from lxml.html import parse
# get a list of the urls of the foods to parse
main_site = "http://www.whfoods.com/foodstoc.php"
doc = parse(main_site).getroot()
doc.make_links_absolute()
sites = doc.xpath('/html/body//div[@class="full3col"]/ul/li/a/@href')

for site in sites:
   doc = parse(site).getroot()
   **table = doc.xpath("descendant::table[1]")[0]**
   #food info list
   table.xpath("//tr/td/table/tr/td/b/text()")
   # food nutrients list
   table.xpath("//tr/td/table[1]/tr/td/text()")

这是网站的html摘录失败（click here，如果你想看到它完整）：

<html>
    <head>
    <body>
        <div id=mainpage">
            <div id="subcontent">
                 (40+ <p> tags with things inside)
                 <p>
                     <table>
                         <tbody>
                             <tr>
                                 <td>
                                     <table>
                                         <tbody>
                                             <tr>
                                                 <td>
                                                     <b>Food's name<br>other things</b>
                                                 </td>
                                             </tr>
                                             <tr>
                                             Heads of the table(not needed)
                                             </tr>
                                             <tr>
                                                 <td>nutrient name</td>
                                                 <td>dv</td>
                                                 <td>density</td>
                                                 <td>rating</td>
                                             </tr>
                                         </tbody>
                                     </table>
                                     <table> Not needed
                                     ...
                            All  remaining closing tags

Answer 1

根据validator.w3.org指向http://www.whfoods.com/genpage.php?tname=foodspice&dbid=97：

Line 253, column 147: non SGML character number 150

  …ed mushrooms by Liquid Chromatography  Mass Spectroscopy. The 230th ACS Natio…

问题特征在“色谱”和“质量”之间。该页面被声明为在ISO-8859-1中编码，但在这种情况下经常发生，它是在说谎：

>>> import unicodedata as ucd
>>> ucd.name(chr(150).decode('cp1252'))
'EN DASH'

也许lxml对此也很挑剔（Firefox并不关心）。

无法使用xpath解析表子项

1 个答案: