我正在按照指南(http://docs.python-guide.org/en/latest/scenarios/scrape/)抓取一个网站(https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/),并通过lxml包网站,无法弄清楚出了什么问题。
我有这段代码:
from lxml import html
import requests
page = requests.get('https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/')
tree = html.fromstring(page.content)
floor = tree.xpath('//div[@class="column floor"]/text()')
sf = tree.xpath('//div[@class="column rsf"]/text()')
但 floor 和 sf 会返回'\ n \ t \ t \ t \ t \ t'值的列表,而不是您期望查看的整数来自实际网站的html(以下情况中的“20”和“5117”):
<div class="availabilityWrap">
<h3>Availabilities</h3>
<div class="availabilityRow headerRow">
<div class="column floor">
<a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
</div>
<div class="column rsf">
<p><b>5117</b></p>
</div>
<div class="column divisible">
<p><b>yes</b></p>
</div>
<div class="column date">
<p><b>05/01/2017</b></p>
</div>
<div class="column space">
<p><b>Office</b></p>
</div>
<div class="column description">
<p><b>model suite</b></p>
</div>
<div class="column rent">
<p><b>$26.55</b></p>
</div>
</div>
它不应该仅仅返回“column floor”div类中的所有文本吗?任何帮助都会很棒。
答案 0 :(得分:0)
floor = tree.xpath('normalize-space(//div[@class="column floor"])')
div
包含\n\t
以获取新的行和空格,这些也是文本,您可以连接所有文本并使用normalize-space()
函数删除空白
In [14]: '''<div class="column floor">
...:
...: <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
...: target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
...:
...: </div>'''
Out[14]: '<div class="column floor">\n\n <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"\n target=\'blank\'><img src="/static/images/pdf.png" class="floorPDF" />20</a>\n\n </div>'
编辑:
for div in tree.xpath('//div[@class="column floor"]'):
print(div.xpath('normalize-space(.)')) # `.` means current node