ICML的草稿,抓取论文

时间:2019-06-18 07:17:27

标签: python scrapy

我想使用Scrapy从ICML程序中检索论文,而我的代码是

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

from scrapy.item import Item, Field


class PapercrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()
    pdf = Field()
    sup = Field()

spider.py

from scrapy import Spider

from scrapy import Spider
from scrapy.selector import Selector
from PaperCrawler.items import PapercrawlerItem


class PaperCrawler(Spider):
    name = "PaperCrawler"
    allowed_domains = ["proceedings.mlr.press"]
    start_urls = ["http://proceedings.mlr.press/v97/", ]

    def parse(self, response):
        papers = Selector(response).xpath('//*[@id="content"]/div/div[2]')

        titles = Selector(response).xpath('//*[@id="content"]/div/div[2]/p[1]')
        pdfs = Selector(response).xpath('//*[@id="content"]/div/div[2]/p[3]/a[2]')
        sups = Selector(response).xpath('//*[@id="content"]/div/div[2]/p[3]/a[3]')

        for title, pdf, sup in zip(titles, pdfs, sups):
            item = PapercrawlerItem()
            item['title'] = title.xpath('text()').extract()[0]
            item['pdf'] = pdf.xpath('@href').extract()[0]
            item['sup'] = sup.xpath('@href').extract()[0]
            yield item

但是,它仅返回第一篇论文的内容。我想抓取链接中的所有论文。我该如何解决?

[
{"title": "AReS and MaRS Adversarial and MMD-Minimizing Regression for SDEs", "pdf": "http://proceedings.mlr.press/v97/abbati19a/abbati19a.pdf", "sup": "http://proceedings.mlr.press/v97/abbati19a/abbati19a-supp.pdf"}
]

1 个答案:

答案 0 :(得分:1)

问题出在 div / div [2] 中。由于您已指定特定的div号,因此Crawler不会进行迭代。相反,您可以为div指定一个选择器,例如。 div [@ class =“ paper”] 在这种情况下,代码可以正常工作。

这是更正的代码:

(defun has-all-divisors (num start stop)
  (declare (type fixnum num start stop))
  (loop for divisor of-type fixnum from start to stop
        always (divides-even-p num divisor)))

通过迭代文件并检查class PaperCrawler(Spider): name = "PaperCrawler" allowed_domains = ["proceedings.mlr.press"] start_urls = ["http://proceedings.mlr.press/v97/", ] def parse(self, response): papers = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]') titles = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]/p[1]') pdfs = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]/p[3]/a[2]') sups = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]/p[3]/a[3]') for title, pdf, sup in zip(titles, pdfs, sups): item = PapercrawlerItem() item['title'] = title.xpath('text()').extract()[0] item['pdf'] = pdf.xpath('@href').extract()[0] item['sup'] = sup.xpath('@href').extract()[0] yield item 的长度可以解决问题

sup