代码

Question

使用Scrapy 0.24 Selectors，我想提取段落内容，包括其他元素的内容（在下面的例子中，它是锚<a>。我怎样才能实现？

代码

>>> from scrapy import Selector
>>> html = """
        <html>
            <head>
                <title>Test</title>
            </head>
            <body>
                <div>
                    <p>Hello, can I get this paragraph content without this <a href="http://google.com">Google link</a>?
                </div>
            </body>
        </html>
        """
>>> sel = Selector(text=html, type="html")
>>> sel.xpath('//p/text()').extract()
[u'Hello, can I get this paragraph content with this ', u'?']

输出

[u'Hello, can I get this paragraph content with this ', u'?']

预期输出

[u'Hello, can I get this paragraph content with this Google link?']

Answer 1

我会推荐BeautifulSoup。虽然scrapy是一个完整的爬行框架，但BS是一个强大的解析库（Difference between BeautifulSoup and Scrapy crawler?）。

Doc：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

安装：pip install beautifulsoup4

对于你的情况：

# 'html' is the one your provided
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
res = [p.get_text().strip() for p in soup.find_all('p')]

结果：

[u'Hello, can I get this paragraph content without this Google link?']

使用Scrapy Selector提取段落文本，包括其他元素的内容

代码

输出

预期输出

1 个答案: