我需要对由标题分隔的元素进行分区。我正在努力制定一个xpath表达式或简单的解析器,它可以将我的项目分组到heading标签给出的部分。
我理解如何抓取列表中元素位于同一级别或元素级别由容器给出的列表,但我正在努力弄清楚如何解析容器由元素分隔的数据。例如:
<div>
<h1>section a</h1>
<item>221</item>
<item>453</item>
<item>473</item>
<h1>section b</h1>
<item>430</item>
<item>493</item>
<h1>section c</h1>
<item>694</item>
<item>931</item>
</div>
是否有一些使用xpath注意结构的范例方法?有没有办法迭代scrapy选择器,以便我看到一个dom视图并检测这些部分的开始和停止?
答案 0 :(得分:2)
使用XPath的一个解决方案是计算h1
下节点的前div
个兄弟节点,节点本身不是h1
$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
Type "copyright", "credits" or "license" for more information.
IPython 1.2.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""
<div>
<h1>section a</h1>
<item>221</item>
<item>453</item>
<item>473</item>
<h1>section b</h1>
<item>430</item>
<item>493</item>
<h1>section c</h1>
<item>694</item>
<item>931</item>
</div>""")
In [3]: for i, header in enumerate(selector.xpath('.//div/h1'), start=1):
print header.xpath('normalize-space()').extract()
between = selector.xpath(""".//div/node()[count(preceding-sibling::h1)=%d]
[not(self::h1)]""" % i)
print between.extract()
...:
[u'section a']
[u'\n', u'<item>221</item>', u'\n', u'<item>453</item>', u'\n', u'<item>473</item>', u'\n']
[u'section b']
[u'\n', u'<item>430</item>', u'\n', u'<item>493</item>', u'\n']
[u'section c']
[u'\n', u'<item>694</item>', u'\n', u'<item>931</item>', u'\n']
答案 1 :(得分:0)
var header = null
var items = []
for each element in div
if element is header
process previous header, items
header = the element text
items = []
else
items append element text
end
process last header, items