lxml xpath()没有返回我期望的内容

时间:2017-03-14 11:41:28

标签: python html web-scraping lxml

我正在按照指南(http://docs.python-guide.org/en/latest/scenarios/scrape/)抓取一个网站(https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/),并通过lxml包网站,无法弄清楚出了什么问题。

我有这段代码:

from lxml import html
import requests

page = requests.get('https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/')
tree = html.fromstring(page.content)

floor = tree.xpath('//div[@class="column floor"]/text()')
sf = tree.xpath('//div[@class="column rsf"]/text()')

floor sf 会返回'\ n \ t \ t \ t \ t \ t'值的列表,而不是您期望查看的整数来自实际网站的html(以下情况中的“20”和“5117”):

<div class="availabilityWrap">
    <h3>Availabilities</h3>

    <div class="availabilityRow headerRow">
        <div class="column floor">

            <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
 target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>

    </div>
        <div class="column rsf">
            <p><b>5117</b></p>
        </div>
        <div class="column divisible">
            <p><b>yes</b></p>
        </div>
        <div class="column date">
            <p><b>05/01/2017</b></p>
        </div>
        <div class="column space">
            <p><b>Office</b></p>
        </div>
        <div class="column description">
            <p><b>model suite</b></p>
        </div>
        <div class="column rent">
            <p><b>$26.55</b></p>
        </div>
    </div>

它不应该仅仅返回“column floor”div类中的所有文本吗?任何帮助都会很棒。

1 个答案:

答案 0 :(得分:0)

floor = tree.xpath('normalize-space(//div[@class="column floor"])')

div包含\n\t以获取新的行和空格,这些也是文本,您可以连接所有文本并使用normalize-space()函数删除空白

In [14]: '''<div class="column floor">
...: 
...:             <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
...:  target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
...: 
...:     </div>'''
Out[14]: '<div class="column floor">\n\n            <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"\n target=\'blank\'><img src="/static/images/pdf.png" class="floorPDF" />20</a>\n\n    </div>'

编辑:

for div in tree.xpath('//div[@class="column floor"]'):
    print(div.xpath('normalize-space(.)')) # `.` means current node