Question

我有一个像这样的搜索结果的源页面（当然只有一小部分）：

<div class="search-results-list-item clearfix is-collapsed is-topad-list-item ">


<div class="list-item-data">
    <h2 class="list-item-title">
        <a href="http://www.mylink.com" name="61492088">Description</a>
    </h2>

            <div class="list-item-location">
        <span>Rimini</span>
    </div>
        </div>

<div class="list-item-price">
    <span>2.000 &euro;</span>
</div>

<div class="list-item-actdate">
    <span>16 February</span>
</div>

</div>

我的程序应该只打印链接（在示例中，“list-item-data”div类中包含的链接），其在“list-item-actdate”中包含单词“Today”。不应打印其他链接，因此在我的示例中，不会打印代码中的唯一链接。

我想过使用BeautifulSoup，但我不知道如何将它用于我的目的。

Answer 1

以下是使用lxml.html代替BeautifulSoup进行此操作的一种方法...它使用XPath搜索文档并提取相关部分。它应该会让您了解如何处理HTML（或XML）文档...

import lxml.html

# Parse the HTML document
html = lxml.html.parse(open('/path/to/source/file').read())

# find div elements which contains a div child with class='list-item-data'
for parent in html.xpath("//div[@class='list-item-data']/.."):

    # get and check the date
    # note xpath returns a list of elements, here we assume only the first match is of 
    # interest (based on the stated structure of the document)
    date = parent.xpath("./div[@class='list-item-actdate']/span")[0].text
    if not date.startswith("Today "):
        continue

    # print the link address
    href = parent.xpath(".//a")[0].attrib['href']
    print href

使用Python选择站点的日期搜索结果

1 个答案: