Question

我是python的新手，但由于与工作相关的原因而需要抓取。花了一两个星期的时间来抓痒，最后对它感到满意，除了下面的代码而不是输出一行数据，将其重复了五次。这是一个示例（仅使用1个网址）：

进口沙皮

class AdamSmithInstituteSpider(scrapy.Spider):
name = "adamsmithinstitute"
start_urls = [
"https://www.adamsmith.org/research?month=March-2018",

]


def parse(self, response):
    for quote in response.css('div.post'):
        yield {
            'author': response.css('post-author::text').extract(),
            'pdfs': response.selector.xpath('//div/div/div/div/div/div/div/p/a').extract(),
        }

    next_page = response.css("div.older a::attr(href)").extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

scrapy shell中的输出如下：

2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:13 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}

我知道数据很杂乱，因为我只想要href链接，但对它自己却很熟悉。我无法动弹的是重复。

任何帮助将不胜感激。

Answer 1

Scrapy仅在重复某项时处理url重复，然后开发人员必须删除重复项

scrapy已记录了重复过滤器管道， click here read it

在该示例中，他们将ID显示为唯一ID，在您的情况下，ID可能与其他ID不同

Answer 2

您在CSS表达式中使用 ABSOLUTE 路径。因此，您的整个表达搜索都在WHOLE文档中（从头开始）。您需要将表达式应用于-Dhttps.proxyUser/Password：

quote

反复抓取重复数据

2 个答案: