XPath:在两个相似标签之间匹配文本

时间:2018-10-21 21:37:13

标签: xpath web-scraping scrapy lxml text-parsing

我正在尝试抓取结构混乱的网站,我所需要的文本位于前5个连续的br标签(不多也不少,恰好是5个)和随后的2个连续的br标签之间。 /> 看起来像这样:

<p class="A">
"Some text"
<br>
"Some text"
<br>
<br>
"Some text"
<br>
<br>
<br>
<br>
<br>
"Required text"
<br>
"Required text"
<br>
"Required text"
<br>
<br>
</p>

1 个答案:

答案 0 :(得分:1)

Scrapy将<br>标记转换为换行符,因此您只需提取整个文本并将其分割为5个换行符:

> text = sel.xpath('//text()').extract()
['\n"Some text"\n', '\n"Some text"\n', ...]
> values = ''.join(text).split('\n\n\n\n\n')[1]
'\n"Required text"\n\n"Required text"\n\n"Required text"\n\n\n'
> values.strip().split('\n\n')
['"Required text"', '"Required text"', '"Required text"']