什么是适当的nokogiri xpath来获得一系列行?

时间:2016-09-16 22:27:29

标签: ruby xpath nokogiri

我有一个表格,格式如下:

<tr class="style6"><td>SomeStuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr class="style6"><td>SomeStuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>

我想要一个行块(从style6类开始到下一个style6出现之前的最后一行)分成可以迭代的组。有没有办法将其分成块?我知道Xpath position函数,但不确定它在这种情况下是否有意义。

有什么想法吗?

1 个答案:

答案 0 :(得分:-1)

一个有用的模式是计算之前的<tr class="style6"><td>SomeStuff</td></tr>

对于您示例中的第一个组,它将是:

//tr[not(@class="style6")][count(preceding-sibling::tr[@class="style6"])=1]

对于第二组:

//tr[not(@class="style6")][count(preceding-sibling::tr[@class="style6"])=2]

我不使用nokogiri所以这里是使用Python和lxml的一个例子:

>>> import lxml.html
>>> from pprint import pprint

>>> doc = lxml.html.fromstring('''<tr class="style6"><td>SomeStuff</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr class="style6"><td>SomeStuff</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr class="style6"><td>SomeStuff</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>''')

>>> pprint(list(lxml.html.tostring(row)
...            for row in doc.xpath('''
...                 //tr[not(@class="style6")]
...                     [count(preceding-sibling::tr[@class="style6"])=1]''')))
[b'<tr><td>Some other stuff group 1</td></tr>\n',
 b'<tr><td>Some other stuff group 1</td></tr>\n',
 b'<tr><td>Some other stuff group 1</td></tr>\n',
 b'<tr><td>Some other stuff group 1</td></tr>\n',
 b'<tr><td>Some other stuff group 1</td></tr>\n']
>>> pprint(list(lxml.html.tostring(row)
...            for row in doc.xpath('''
...                 //tr[not(@class="style6")]
...                     [count(preceding-sibling::tr[@class="style6"])=2]''')))
[b'<tr><td>Some other stuff group 2</td></tr>\n',
 b'<tr><td>Some other stuff group 2</td></tr>\n',
 b'<tr><td>Some other stuff group 2</td></tr>\n',
 b'<tr><td>Some other stuff group 2</td></tr>\n',
 b'<tr><td>Some other stuff group 2</td></tr>\n']
>>> pprint(list(lxml.html.tostring(row)
...            for row in doc.xpath('''
...                 //tr[not(@class="style6")]
...                     [count(preceding-sibling::tr[@class="style6"])=3]''')))
[b'<tr><td>Some other stuff group 3</td></tr>\n',
 b'<tr><td>Some other stuff group 3</td></tr>\n',
 b'<tr><td>Some other stuff group 3</td></tr>\n',
 b'<tr><td>Some other stuff group 3</td></tr>\n',
 b'<tr><td>Some other stuff group 3</td></tr>']
>>>