用scrapy复制数据

时间:2016-02-03 19:14:20

标签: python xpath scrapy

我正在尝试获取本网站上发布的文章,作者,链接,ISSN / ISBN和年份:

http://eprints.bbk.ac.uk/view/subjects/csis.html

使用以下代码:

<system.serviceModel>
  <bindings>
    <basicHttpBinding>
      <binding name="WSBinding" messageEncoding="Mtom" maxReceivedMessageSize="100000000" >
        <security mode="Transport">
          <transport clientCredentialType="None" proxyCredentialType="None" realm="" />
          <message clientCredentialType="Certificate" algorithmSuite="Default" />
        </security>
      </binding>
    </basicHttpBinding>
  </bindings>
  <client>
    <endpoint address="http://192.168.40.149:80/WSJava/WS"
      binding="basicHttpBinding" bindingConfiguration="WSBinding"
      contract="WSGesAl.WS" name="WSContract" />
  </client>
</system.serviceModel>

哪个有效,但出于某种原因,ISSN / ISBN的年份总是相同的:

img1

除了这个问题之外,正如您所看到的,ISBN / ISSN在某些出版物中有不同的格式,甚至没有,¿我如何定义XPath以使用所有格式?

2 个答案:

答案 0 :(得分:0)

在具有if i == 0:的部分之后,您不再需要在发布者响应上重复xpath,而是在结果中存储在publicationes数组中。因此,请不要在publicacion['anio_publicacion'] = response.xpath("//div...上应用正则表达式,而应在publicacion['anio_publicacion'] = publicaciones[o]...上,因为您声明了迭代器但不将其用于发布。在类似的情况下,我建议你为你的年份构建一个数组然后迭代它 - 在这里你没有构建迭代器,所以你不能真正应用它。

尝试摆脱你的重复,你的代码将更容易阅读,遵循和纠正。重复过多,你很容易感到沮丧。其他提示,您不需要从文档的根目录开始查找xpath,这将使其更短并且再次更容易调试(在xpath中开始使用//./表达式)。所以例如:你可以在开始时声明:

storedResponse = response.xpath("//div[@class='ep_view_page ep_view_page_view_subjects']")

然后做例如:

for sel in storedResponse:
    publicaciones = sel.xpath("./p/a/text()").extract() #publicacion

如果您以相同的方式处理ISBN和ISSN,请小心,因为带有短划线的正则表达式仅适用于您的ISSN而非ISBN (r'\d\d\d\d-\d\d\d\d')

答案 1 :(得分:0)

您在sel上使用绝对XPath表达式,它已经是另一个绝对XPath表达式(//...)中的选定元素。 这意味着他们将为每次迭代始终选择相同的元素,而不是相对于sel

相对XPath表达式(.//...)是这样的循环方式。这是第一件需要解决的问题。

作为奖励,并且为了说明,这里有一个评论和简单的蜘蛛做你需要的(我认为)。

import scrapy

class PublicationSpider(scrapy.Spider):

    name = 'pubspider'
    start_urls = ('http://eprints.bbk.ac.uk/view/subjects/csis.html',)

    def parse(self, response):

        # each publication is within a <p> element, let's loop on those
        for publication in response.css('div > div.ep_tm_page_content > div.ep_view_page.ep_view_page_view_subjects > p'):

            # the publication title is inside a <a> link, which also contains a URL
            for title in publication.xpath('./a'):
                pubtitle = title.xpath('normalize-space(.)').extract_first()
                publink = title.xpath('@href').extract_first()
                break
            # use a regex to find year digits inside brackets
            pubyear = publication.xpath('./text()').re_first(r'\((\d+)\)')

            # get text nodes from <span> before the link
            authors = publication.xpath('./span[@class="person_name"][./following-sibling::a]/text()').extract()

            # get text nodes after the link and use a regex matching either ISBN or ISSN,
            # get first result with re_first()
            # this can be an empty list is there no ISxN
            isxn = publication.xpath('./a/following-sibling::text()').re_first(r'(ISBN\s+\d+|ISSN\s+\d+-\d+)')
            yield {
                'title': pubtitle,
                'link': publink,
                'year':  pubyear,
                'authors': authors,
                'isxn': isxn
            }

您可以使用scrapy runspider运行此蜘蛛:

$ scrapy runspider 35185701.py 
2016-02-03 22:16:25 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-03 22:16:26 [scrapy] DEBUG: Crawled (200) <GET http://eprints.bbk.ac.uk/view/subjects/csis.html> (referer: None)
2016-02-03 22:16:26 [scrapy] DEBUG: Scraped from <200 http://eprints.bbk.ac.uk/view/subjects/csis.html>
{'authors': [u'Adam, S.P.', u'Karras, D.A.', u'Magoulas, George D.', u'Vrahatis, M.N.'], 'year': u'2014', 'link': u'http://eprints.bbk.ac.uk/13757/', 'isxn': u'ISSN 0893-6080', 'title': u'Solving the linear interval tolerance problem for weight initialization of neural networks.'}
...
2016-02-03 22:16:27 [scrapy] DEBUG: Scraped from <200 http://eprints.bbk.ac.uk/view/subjects/csis.html>
{'authors': [u'Zuccon, G.', u'Azzopardi, L.', u'Zhang, Dell', u'Wang, J.'], 'year': u'2012', 'link': u'http://eprints.bbk.ac.uk/7099/', 'isxn': u'ISBN 9783642289965', 'title': u'Top-k retrieval using facility location analysis.'}
2016-02-03 22:16:27 [scrapy] INFO: Closing spider (finished)
2016-02-03 22:16:27 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 238,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 434181,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 2, 3, 21, 16, 27, 657239),
 'item_scraped_count': 794,
 'log_count/DEBUG': 796,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 2, 3, 21, 16, 25, 985416)}
2016-02-03 22:16:27 [scrapy] INFO: Spider closed (finished)