Question

我编写了以下脚本来从this site中抓取数据：

import scrapy

class MySpider(scrapy.Spider):
    name = 'jobs'
    start_urls = ['https://www.freelancer.in/jobs/python_web-scraping_web-crawling/']

    def parse(self, response):

        for title in response.xpath('//div[@class = "JobSearchCard-primary-heading"]//a'):
            yield{
                'title' : title.xpath('a/text()').extract_first()
            }

然而，当我运行它时，我只收到一个空文件，除了标题？为什么会这样？

Answer 1

您的XPath选择器返回None。它可能应该是：

'title' : title.xpath('text()').extract_first()

此外，您可以删除过多的符号：

'title' : title.xpath('text()').extract_first(default='').strip()

default=''旨在避免因选择器一无所获而发生异常。

Answer 2

试一试并告诉我你没有从该页面获得预期的标题。您定义的xpath有问题。此外，每个字符串中都有巨大的空格，因此您也需要.strip()个空格。下面的脚本将为您提供干净的输出。

import scrapy

class MySpider(scrapy.Spider):
    name = 'jobs'
    start_urls = ['https://www.freelancer.in/jobs/python_web-scraping_web-crawling/']

    def parse(self, response):

        for title in response.xpath('//*[@class="JobSearchCard-primary-heading-link"]/text()').extract():
            yield{
                'title' : title.strip()
            }

Answer 3

试试这个：

import scrapy

class MySpider(scrapy.Spider):
    name = 'jobs'
    start_urls = ['https://www.freelancer.in/jobs/python_web-scraping_web-crawling/']

    def parse(self, response):
        for title in response.xpath('//div[@class = "JobSearchCard-primary-heading"]//a'):
            yield {
                'title' : title.xpath('./text()').extract_first().strip()
            }

内部xpath应该相对于循环的节点。

Scrappy没有刮掉数据

3 个答案: