Question

我想要抓住一个事件的标题。

为此，我编写了以下xpath命令，其中没有一个工作：

response.xpath('//h1/@title').extract()

response.xpath('//id/class/h1/@title').extract()

response.xpath('//*[@class ="pd-lr-10 span9"]/h1/@title').extract()

response.xpath('//*[@class = "banner-container"]/h2').extract()

response.xpath('//*[@class = "overlay-h1"]/@title').extract()

上面的所有命令都返回了一个空列表。

Answer 1

试试这个xpath获取标题：

response.xpath("//h1[@class='overlay-h1']/text()").extract_first()

以下是您可以从任何IDE中进行操作的方法：

import scrapy
from scrapy.crawler import CrawlerProcess

class AlleventsTestSpider(scrapy.Spider):
    name = 'titlegrabber'
    start_urls = ['https://allevents.in/kolkata/gourmet-cookies-workshop-on-21st-april/1649973561753390']

    def parse(self, response):
        title_one = response.xpath("//h1[@class='overlay-h1']/@title").extract_first()
        title_two = response.xpath("//h1[@class='overlay-h1']/text()").extract_first()

        yield {
                "TitleOne":title_one,
                "TitleTwo": title_two
            }

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

})
c.crawl(AlleventsTestSpider)
c.start()

Answer 2

这些都有效。这可能是由于503错误造成的。在scrapy shell中，使用view(response)检查您是否获得该页面。之后，您可以选择其中一个选择器。

response.xpath('//*[@class ="pd-lr-10 span9"]/h1/@title').extract()

response.xpath('//*[@class = "overlay-h1"]/@title').extract()

response.xpath('//h1/@title').extract()

注意：如果您未在设置文件中启用USER AGENT，则可能会对您有所帮助。或者您可以更改您的IP地址。

数据点不能用scrapy和python抓取

2 个答案: