我正在尝试获取本网站上发布的文章,作者,链接,ISSN / ISBN和年份:
http://eprints.bbk.ac.uk/view/subjects/csis.html
使用以下代码:
<system.serviceModel>
<bindings>
<basicHttpBinding>
<binding name="WSBinding" messageEncoding="Mtom" maxReceivedMessageSize="100000000" >
<security mode="Transport">
<transport clientCredentialType="None" proxyCredentialType="None" realm="" />
<message clientCredentialType="Certificate" algorithmSuite="Default" />
</security>
</binding>
</basicHttpBinding>
</bindings>
<client>
<endpoint address="http://192.168.40.149:80/WSJava/WS"
binding="basicHttpBinding" bindingConfiguration="WSBinding"
contract="WSGesAl.WS" name="WSContract" />
</client>
</system.serviceModel>
哪个有效,但出于某种原因,ISSN / ISBN的年份总是相同的:
除了这个问题之外,正如您所看到的,ISBN / ISSN在某些出版物中有不同的格式,甚至没有,¿我如何定义XPath以使用所有格式?
答案 0 :(得分:0)
在具有if i == 0:
的部分之后,您不再需要在发布者响应上重复xpath,而是在结果中存储在publicationes数组中。因此,请不要在publicacion['anio_publicacion'] = response.xpath("//div...
上应用正则表达式,而应在publicacion['anio_publicacion'] = publicaciones[o]...
上,因为您声明了迭代器但不将其用于发布。在类似的情况下,我建议你为你的年份构建一个数组然后迭代它 - 在这里你没有构建迭代器,所以你不能真正应用它。
尝试摆脱你的重复,你的代码将更容易阅读,遵循和纠正。重复过多,你很容易感到沮丧。其他提示,您不需要从文档的根目录开始查找xpath,这将使其更短并且再次更容易调试(在xpath中开始使用//
和./
表达式)。所以例如:你可以在开始时声明:
storedResponse = response.xpath("//div[@class='ep_view_page ep_view_page_view_subjects']")
然后做例如:
for sel in storedResponse:
publicaciones = sel.xpath("./p/a/text()").extract() #publicacion
如果您以相同的方式处理ISBN和ISSN,请小心,因为带有短划线的正则表达式仅适用于您的ISSN而非ISBN (r'\d\d\d\d-\d\d\d\d')
答案 1 :(得分:0)
您在sel
上使用绝对XPath表达式,它已经是另一个绝对XPath表达式(//...
)中的选定元素。
这意味着他们将为每次迭代始终选择相同的元素,而不是相对于sel
。
相对XPath表达式(.//...
)是这样的循环方式。这是第一件需要解决的问题。
作为奖励,并且为了说明,这里有一个评论和简单的蜘蛛做你需要的(我认为)。
import scrapy
class PublicationSpider(scrapy.Spider):
name = 'pubspider'
start_urls = ('http://eprints.bbk.ac.uk/view/subjects/csis.html',)
def parse(self, response):
# each publication is within a <p> element, let's loop on those
for publication in response.css('div > div.ep_tm_page_content > div.ep_view_page.ep_view_page_view_subjects > p'):
# the publication title is inside a <a> link, which also contains a URL
for title in publication.xpath('./a'):
pubtitle = title.xpath('normalize-space(.)').extract_first()
publink = title.xpath('@href').extract_first()
break
# use a regex to find year digits inside brackets
pubyear = publication.xpath('./text()').re_first(r'\((\d+)\)')
# get text nodes from <span> before the link
authors = publication.xpath('./span[@class="person_name"][./following-sibling::a]/text()').extract()
# get text nodes after the link and use a regex matching either ISBN or ISSN,
# get first result with re_first()
# this can be an empty list is there no ISxN
isxn = publication.xpath('./a/following-sibling::text()').re_first(r'(ISBN\s+\d+|ISSN\s+\d+-\d+)')
yield {
'title': pubtitle,
'link': publink,
'year': pubyear,
'authors': authors,
'isxn': isxn
}
您可以使用scrapy runspider
运行此蜘蛛:
$ scrapy runspider 35185701.py
2016-02-03 22:16:25 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-03 22:16:26 [scrapy] DEBUG: Crawled (200) <GET http://eprints.bbk.ac.uk/view/subjects/csis.html> (referer: None)
2016-02-03 22:16:26 [scrapy] DEBUG: Scraped from <200 http://eprints.bbk.ac.uk/view/subjects/csis.html>
{'authors': [u'Adam, S.P.', u'Karras, D.A.', u'Magoulas, George D.', u'Vrahatis, M.N.'], 'year': u'2014', 'link': u'http://eprints.bbk.ac.uk/13757/', 'isxn': u'ISSN 0893-6080', 'title': u'Solving the linear interval tolerance problem for weight initialization of neural networks.'}
...
2016-02-03 22:16:27 [scrapy] DEBUG: Scraped from <200 http://eprints.bbk.ac.uk/view/subjects/csis.html>
{'authors': [u'Zuccon, G.', u'Azzopardi, L.', u'Zhang, Dell', u'Wang, J.'], 'year': u'2012', 'link': u'http://eprints.bbk.ac.uk/7099/', 'isxn': u'ISBN 9783642289965', 'title': u'Top-k retrieval using facility location analysis.'}
2016-02-03 22:16:27 [scrapy] INFO: Closing spider (finished)
2016-02-03 22:16:27 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 238,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 434181,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 3, 21, 16, 27, 657239),
'item_scraped_count': 794,
'log_count/DEBUG': 796,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 2, 3, 21, 16, 25, 985416)}
2016-02-03 22:16:27 [scrapy] INFO: Spider closed (finished)