反复抓取重复数据

时间:2018-07-10 11:00:57

标签: python web-scraping scrapy

我是python的新手,但由于与工作相关的原因而需要抓取。花了一两个星期的时间来抓痒,最后对它感到满意,除了下面的代码而不是输出一行数据,将其重复了五次。这是一个示例(仅使用1个网址):

进口沙皮

class AdamSmithInstituteSpider(scrapy.Spider):
name = "adamsmithinstitute"
start_urls = [
"https://www.adamsmith.org/research?month=March-2018",

]


def parse(self, response):
    for quote in response.css('div.post'):
        yield {
            'author': response.css('post-author::text').extract(),
            'pdfs': response.selector.xpath('//div/div/div/div/div/div/div/p/a').extract(),
        }

    next_page = response.css("div.older a::attr(href)").extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

scrapy shell中的输出如下:

2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:13 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}

我知道数据很杂乱,因为我只想要href链接,但对它自己却很熟悉。我无法动弹的是重复。

任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:0)

Scrapy仅在重复某项时处理url重复,然后开发人员必须删除重复项

scrapy已记录了重复过滤器管道, click here read it

在该示例中,他们将ID显示为唯一ID,在您的情况下,ID可能与其他ID不同

答案 1 :(得分:0)

您在CSS表达式中使用 ABSOLUTE 路径。因此,您的整个表达搜索都在WHOLE文档中(从头开始)。您需要将表达式应用于-Dhttps.proxyUser/Password

quote