Question

我是python的新手，正在尝试构建一个脚本，该脚本最终将页面标题和s从指定的URL提取到我指定的格式的.csv中。

我尝试使用以下方法设法使蜘蛛在CMD中工作：

response.xpath("/html/head/title/text()").get()

因此，xpath必须正确。

不幸的是，当我运行文件时，我的蜘蛛所在的文件似乎永远无法正常工作。我认为问题出在代码的最后一块，不幸的是，我遵循的所有指南似乎都使用CSS。我对xpath感到更自在，因为您可以直接从Dev Tools复制并粘贴它。

import scrapy
class PageSpider(scrapy.Spider):
    name = "dorothy"
    start_urls = [
        "http://www.example.com",
        "http://www.example.com/blog"]

def parse(self, response):
    for title in response.xpath("/html/head/title/text()"):
        yield {
        "title": sel.xpath("Title a::text").extract_first()
        }

我希望能在什么时候给我上述URL的页面标题。

Answer 1

首先，您在self.start_urls上的第二个URL无效并返回404，因此最终只会提取一个标题。

第二，您需要阅读有关selectors的更多信息，您已经在shell上的测试中提取了标题，但在蜘蛛上使用它时却感到困惑。

Scrapy将为parse上的每个URL调用self.start_urls方法，因此您不需要迭代槽标题，每页只有一个。

您还可以在xpath表达式的开头直接使用<title>访问//标记，请参见从W3Schools复制的文本：

/   Selects from the root node
//  Selects nodes in the document from the current node that match the selection no matter where they are

这是固定代码：

import scrapy

class PageSpider(scrapy.Spider):
    name = "dorothy"
    start_urls = [
        "http://www.example.com"
    ]

    def parse(self, response):
        yield {
            "title": response.xpath('//title/text()').extract_first()
        }

如何：获取Python Scrapy来运行简单的xpath检索

1 个答案: