Question

我有一个html文档，其中有几个不同的（但相关的）div类。例如：

<div class="title_dep1"></div>
<div class="title_dep2"></div>
<div class="title_dep3"></div>

我想指示response.xpath返回所有这些内容。我在想像数字上的正则表达式，如

response.xpath('//div[@class="title_dep[\d]+"]').extract()

但你不能这样做。以上述格式检索所有div的最佳方法是什么？

Answer 1

您可以使用 contains ：

from lxml import html

HTML = """<div class="title_dep1">Hi Dervin</div>
<div class="title_dep2">This is the way to grab</div>
<div class="title_dep3">Different divs with the same prefix in @class attribute</div>"""

data = html.fromstring(HTML)
print data.xpath('//div[contains(@class,"title_dep")]/text()')

或者您可以在XPath中使用 re （正则表达式）：

print data.xpath('//div[re:match(@class, "title_dep\d+")]/text()', namespaces={"re": "http://exslt.org/regular-expressions"})

您需要提供命名空间映射，以便它知道xpath表达式中的“re”前缀代表什么。

两者的输出相同：

['Hi Dervin', 'This is the way to grab', 'Different divs with the same prefix in @class attribute']

用正则表达式提取div

1 个答案: