我有一个像这样的搜索结果的源页面(当然只有一小部分):
<div class="search-results-list-item clearfix is-collapsed is-topad-list-item ">
<div class="list-item-data">
<h2 class="list-item-title">
<a href="http://www.mylink.com" name="61492088">Description</a>
</h2>
<div class="list-item-location">
<span>Rimini</span>
</div>
</div>
<div class="list-item-price">
<span>2.000 €</span>
</div>
<div class="list-item-actdate">
<span>16 February</span>
</div>
</div>
我的程序应该只打印链接(在示例中,“list-item-data”div类中包含的链接),其在“list-item-actdate”中包含单词“Today”。不应打印其他链接,因此在我的示例中,不会打印代码中的唯一链接。
我想过使用BeautifulSoup,但我不知道如何将它用于我的目的。
答案 0 :(得分:0)
以下是使用lxml.html
代替BeautifulSoup进行此操作的一种方法...它使用XPath搜索文档并提取相关部分。它应该会让您了解如何处理HTML(或XML)文档...
import lxml.html
# Parse the HTML document
html = lxml.html.parse(open('/path/to/source/file').read())
# find div elements which contains a div child with class='list-item-data'
for parent in html.xpath("//div[@class='list-item-data']/.."):
# get and check the date
# note xpath returns a list of elements, here we assume only the first match is of
# interest (based on the stated structure of the document)
date = parent.xpath("./div[@class='list-item-actdate']/span")[0].text
if not date.startswith("Today "):
continue
# print the link address
href = parent.xpath(".//a")[0].attrib['href']
print href