Question

我使用scrapy创建了一个爬虫。爬虫正在抓取网站并获取链接。 **使用的技术：**Python，Scrapy 错误爬虫正在获取相对 url，因此爬虫无法抓取网页。我希望爬虫只获取绝对网址。请帮忙！！

import scrapy
import os
class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    # this is equivalent to what you would set in settings.py file
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'file1.csv'
    }
    filePath='file1.csv'
    if os.path.exists(filePath):
     os.remove(filePath)
    else:
     print("Can not delete the file as it doesn't exists")
    start_urls = ['https://www.jamoona.com/']

    def parse(self, response):
        titles = response.xpath("//a/@href").extract()
        for  title in titles:
            yield {'title': title}

Answer 1

这是答案。

import scrapy

import os

class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    # this is equivalent to what you would set in settings.py file
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'file1.csv'
    }
    filePath = 'file1.csv'
    if os.path.exists(filePath):
        os.remove(filePath)
    else:
        print("Can not delete the file as it doesn't exists")
    start_urls = ['https://www.jamoona.com/']

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            abs_url = response.urljoin(url)
            yield {'title': abs_url}

爬虫正在获取相关链接

1 个答案: