Question

我想使用Python从HTML文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我想要的输出基本相同。要解决这个问题，我需要使用框架。例如，取一个页面https://en.wikipedia.org/wiki/Main_Page，从而提取100页而不离开域名en.wikipedia.org

Answer 1

简单的基本代码示例，满足您的需求。

from scrapy import Spider


class Foo(Spider):
    # start urls executed at the beginning
    # with default callback "parse"
    start_urls = ["https://en.wikipedia.org/wiki/Main_Page"]
    name = "basic_spider"

    def parse(self, response):
        # use css or xpath selectors to extract text
        print(response.css("::text").extract())

将上面保存为spider.py并使用

运行它

scrapy runspider spider.py

从Scrapy tutorial开始，如果您觉得某些内容不明确或需要改进，请随时改进文档，它们托管在github上。

当然你必须学习Python first，所以如果你不熟悉它，请先学习Python。

Extraxt＆amp;使用Scrapy保存所有text（）

1 个答案: