使用Scrapy 0.24 Selectors,我想提取段落内容,包括其他元素的内容(在下面的例子中,它是锚<a>
。我怎样才能实现?
>>> from scrapy import Selector
>>> html = """
<html>
<head>
<title>Test</title>
</head>
<body>
<div>
<p>Hello, can I get this paragraph content without this <a href="http://google.com">Google link</a>?
</div>
</body>
</html>
"""
>>> sel = Selector(text=html, type="html")
>>> sel.xpath('//p/text()').extract()
[u'Hello, can I get this paragraph content with this ', u'?']
[u'Hello, can I get this paragraph content with this ', u'?']
[u'Hello, can I get this paragraph content with this Google link?']
答案 0 :(得分:0)
我会推荐BeautifulSoup。虽然scrapy是一个完整的爬行框架,但BS是一个强大的解析库(Difference between BeautifulSoup and Scrapy crawler?)。
Doc:http://www.crummy.com/software/BeautifulSoup/bs4/doc/
安装:pip install beautifulsoup4
对于你的情况:
# 'html' is the one your provided
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
res = [p.get_text().strip() for p in soup.find_all('p')]
结果:
[u'Hello, can I get this paragraph content without this Google link?']