在python中使用Xpath解析html

时间:2015-04-29 10:01:14

标签: python html xpath html-parsing

我有一个html,我试图用xpath解析。但我只会得到空洞的回报。任何人都可以告诉我我错在哪里。我已经尝试了一切,但无法成功。

标签的Xpath代码:

divLbl=ch.xpath("//div[@class='left-container']/article/ul[@class='list-unstyled row']/li[@class='col-sm-6 mrg-bottom']/span[@class='text-light']")

相应标签值的Xpath代码:

divVal=ch.xpath("//div[@class='left-container']/article/ul[@class='list-unstyled row']/li[@class='col-sm-6 mrg-bottom']/span[@class='text-light']/strong")

HTML值:

<div>
                        <h2 class="rowbreak"><strong>Information of the Car</strong></h2>
                        <ul class=" list-unstyled row">
                            <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-calendar text-light"></span> <span class=" text-light">Make Year:</span> <strong>Aug 2009</strong></li>
                            <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-road text-light"></span> <span class=" text-light">Kilometers:</span> <strong>127,553</strong></li>
                            <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-map-marker text-light"></span> <span class=" text-light">City:</span> 
                                <strong class="carCity_795606">  
                                                                        <a href="javascript:void(0);" onclick="javascript: $( &quot;#maplinkbtn&quot; ).trigger( &quot;click&quot; ); ">
                                    Sambalpur                                    </a>
                                                                    </strong>

                            </li>
                            <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-calendar text-light"></span> <span class=" text-light">Listing Date:</span> <strong>27 Apr 2015</strong></li>
                            <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-user text-light"></span> <span class=" text-light">No. of Owners:</span> <strong> First Owner</strong>
                            </li>
                            <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-tint text-light"></span> <span class=" text-light">Fuel Type:</span> <strong> Petrol</strong></li>
                              <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-user text-light"></span> <span class=" text-light">Posted by:</span> <strong> 
                                  Dealer</strong>
                            </li>
                        </ul>
           </div>

编辑HTML:

 <div>
                    <h2 class="rowbreak"><strong>Information of the Car</strong></h2>
                    <ul class=" list-unstyled row">
                        <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-calendar text-light"></span> <span class=" text-light">Make Year:</span> <strong>Aug 2009</strong></li>
                        <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-road text-light"></span> <span class=" text-light">Kilometers:</span> <strong>127,553</strong></li>
                        <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-map-marker text-light"></span> <span class=" text-light">City:</span> 
                            <strong class="carCity_795606">  
                                                                    <a href="javascript:void(0);" onclick="javascript: $( &quot;#maplinkbtn&quot; ).trigger( &quot;click&quot; ); ">
                                Sambalpur                                    </a>
                                                                </strong>

                        </li>
                        <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-calendar text-light"></span> <span class=" text-light">Listing Date:</span> <strong>27 Apr 2015</strong></li>
                        <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-user text-light"></span> <span class=" text-light">No. of Owners:</span> <strong> First Owner</strong>
                        </li>
                        <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-tint text-light"></span> <span class=" text-light">Fuel Type:</span> <strong> Petrol</strong></li>
                          <li class="col-sm-4 mrg-bottom"><span class="glyphicon glyphicon-user text-light"></span> <span class=" text-light">Posted by:</span> <strong> 
                              Dealer</strong>
                        </li>
                    </ul>
       </div>

 <h2 class="rowbreak"></h2>
    <ul class=" list-unstyled row">
                            <li class="col-sm-6 mrg-bottom"><span class=" text-light">One Time Tax :</span> <strong>Individual</strong></li>
                            <li class="col-sm-6 mrg-bottom"><span class=" text-light">Registration No. :</span> <strong>OR03F3141</strong></li>
                            <li class="col-sm-6 mrg-bottom"><span class=" text-light"> Insurance &amp; Expiry :</span> <strong>No Insurance&nbsp;</strong></li>
                            <li class="col-sm-6 mrg-bottom"><span class=" text-light">Registration Place: </span> <strong> Sambalpur</strong></li>
                            <li class="col-sm-6 mrg-bottom"><span class=" text-light">Transmission :</span> <strong>Manual</strong></li>
                            <li class="col-sm-6 mrg-bottom"><span class=" text-light">Color :</span> <strong>Silver</strong></li>
                        </ul>

1 个答案:

答案 0 :(得分:3)

您当前使用的XPath是非常脆弱 - 您正在检查链中的每个元素并使用“面向布局”的类。

我将从包含h2元素的strong元素开始,并带有“汽车信息”文本,并获取以下ul元素。例如。得到所有标签:

//h2[strong = 'Information of the Car']/following-sibling::ul/li/span/text()

演示:

In [3]: ch = fromstring(data)

In [4]: ch.xpath("//h2[strong = 'Information of the Car']/following-sibling::ul/li/span/text()")
['Make Year:', 'Kilometers:', 'City:', 'No. of Owners:', 'Fuel Type:', 'Posted by:']

示例(获取名称和值):

In [25]: for field in ch.xpath("//h2/following-sibling::ul/li"):
    name = ''.join(field.xpath(".//span/text()")).strip()
    value = ''.join(field.xpath(".//strong//text()")).strip()
    print name, value
   ....:     
Make Year: Aug 2009
Kilometers: 127,553
City: Sambalpur
Listing Date: 27 Apr 2015
No. of Owners: First Owner
Fuel Type: Petrol
Posted by: Dealer
One Time Tax : Individual
Registration No. : OR03F3141
Insurance & Expiry : No Insurance
Registration Place: Sambalpur
Transmission : Manual
Color : Silver