Question

我正在尝试使用scrapy从arXiv页面获取信息，但无法从xml page中选择“项目”：

from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector

class arXivSpider(BaseSpider):
    name = "arxiv"
    allowed_domains = ["arxiv.org"]
    start_urls = ["http://export.arxiv.org/rss/hep-th/recent"]

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        papers = xxs.select('//item')
        print papers

如果我可以提取它，那么item对象非常简单......

<item rdf:about="http://arxiv.org/abs/1112.5754">
<title>blah blah ... blah</title>
<link>http://arxiv.org/abs/1112.5754</link>
<description rdf:parseType="Literal"><p>...</p></description>
<dc:creator>blah, blah blah</dc:creator>
</item>

脚本运行完美，只是papers = []所以蜘蛛没有收集item。它可能必须用名称空间...

Answer 1

可能需要使用w /名称空间......

是的。

XmlXPathSelector能够通过注册命名空间来处理命名空间（examples in documentation）。在你的情况下：

$ scrapy shell http://export.arxiv.org/rss/hep-th/recent
In [1]: xxs.register_namespace('g', 'http://purl.org/rss/1.0/')

In [2]: xxs.namespaces
Out[2]: {'g': 'http://purl.org/rss/1.0/'}

In [3]: xxs.select('//item')
Out[3]: []

In [4]: xxs.select('//g:item')
Out[4]:
[<XmlXPathSelector xpath='//g:item' data=u'<item xmlns="http://purl.org/rss/1.0/" x'>,
 <XmlXPathSelector xpath='//g:item' data=u'<item xmlns="http://purl.org/rss/1.0/" x'>,
...

Answer 2

我认为你应该尝试使用scrapy shell进行实验。 1. scrapy shell'http://export.arxiv.org/rss/hep-th/recent'

sel.remove_namespaces（）
a = sel.xpath（'// title / text（）'）

enter image description here

用Scrapy抓取arXiv xml数据

2 个答案: