我是python的新手,但由于与工作相关的原因而需要抓取。花了一两个星期的时间来抓痒,最后对它感到满意,除了下面的代码而不是输出一行数据,将其重复了五次。这是一个示例(仅使用1个网址):
进口沙皮
class AdamSmithInstituteSpider(scrapy.Spider):
name = "adamsmithinstitute"
start_urls = [
"https://www.adamsmith.org/research?month=March-2018",
]
def parse(self, response):
for quote in response.css('div.post'):
yield {
'author': response.css('post-author::text').extract(),
'pdfs': response.selector.xpath('//div/div/div/div/div/div/div/p/a').extract(),
}
next_page = response.css("div.older a::attr(href)").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
scrapy shell中的输出如下:
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read
the full paper</a>']}
2018-07-10 11:53:13 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read
the full paper</a>']}
我知道数据很杂乱,因为我只想要href链接,但对它自己却很熟悉。我无法动弹的是重复。
任何帮助将不胜感激。
答案 0 :(得分:0)
Scrapy仅在重复某项时处理url重复,然后开发人员必须删除重复项
scrapy已记录了重复过滤器管道, click here read it
在该示例中,他们将ID显示为唯一ID,在您的情况下,ID可能与其他ID不同
答案 1 :(得分:0)
您在CSS表达式中使用 ABSOLUTE 路径。因此,您的整个表达搜索都在WHOLE文档中(从头开始)。您需要将表达式应用于-Dhttps.proxyUser/Password
:
quote