我是使用Xpath的新手。我试图使用Xpath在Python中解析一些数据。
解析以下HTML:
<table>
<tr>
<td class="DT">29-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="SomeClass">Some other text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="SomeOtherClass">Some more text</td>
</tr>
<tr>
<td class="DT">22-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsAm">more text</td>
</tr>
<tr>
<td class="DT">30-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsBr">Some other Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsBr">More Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsBr">Some different text</td>
</tr>
</table>
我需要<td>
以下兄弟姐妹<tr>
中的所有<tr>
<td>
后<tr>
中有一些值,但直到下一个<td>
{1}}在所有<tr>
中包含一些值。
E.g。假设我的当前位置是第一个 <td class="SomeClass">Some other text</td>
<td class="SomeOtherClass">Some more text</td>
,我需要这些表格单元格:
<tr>
<td class="DT">22-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
假设我当前的位置是表格第4行
<td class="OmsAm">more text</td>
我只需要
<tr>
这是我用来获取所有兄弟./following-sibling::tr/td[1][not(text()[1])]/..
的X路径,但它让我所有 follinwg兄弟姐妹,直到兄弟姐妹停止它:{{1 }}
我认为我必须实施Kayesian方法,但在我的案例中我不明白这一点。任何帮助都会非常有用!
答案 0 :(得分:0)
我可能会误解这个问题,但如果对于每个<tr><td class="DT">xx-xx-xx</td>
,您希望所有<tr>
之后,以及下一个<tr><td class="DT">xx-xx-xx</td>
之前,一个模式就是循环使用这些&<tr><td class="DT">xx-xx-xx</td>
#34;边界&#34; lxml
个元素,并选择以下兄弟行,条件是有多少&#34;边界&#34;以前找到了。
让我们用>>> import lxml.html
>>> t = '''<table>
... <tr>
... <td class="DT">29-04-14</td>
... <td class="Regio">Text</td>
... <td class="Md">Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="SomeClass">Some other text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="SomeOtherClass">Some more text</td>
... </tr>
... <tr>
... <td class="DT">22-04-14</td>
... <td class="Regio">Text</td>
... <td class="Md">Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsAm">more text</td>
... </tr>
... <tr>
... <td class="DT">30-04-14</td>
... <td class="Regio">Text</td>
... <td class="Md">Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsBr">Some other Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsBr">More Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsBr">Some different text</td>
... </tr>
... </table>'''
>>> doc = lxml.html.fromstring(t)
来说明。首先,我们从您的示例输入中创建一个文档:
<tr><td class="DT">xx-xx-xx</td>
现在,让我们算上这些>>> doc.xpath('//table/tr[td/@class="DT"]')
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00638>]
>>> doc.xpath('count(//table/tr[td/@class="DT"])')
3.0
>>> list(enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1))
[(1, <Element tr at 0x7f948ab00548>), (2, <Element tr at 0x7f948ab005e8>), (3, <Element tr at 0x7f948ab00638>)]
:
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('./following-sibling::tr/td/text()') )
...
['Some other text', 'Some more text', '22-04-14', 'Text', 'Text', 'more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['Some other Text', 'More Text', 'Some different text']
我们可以循环这些行并选择文档中后面的行(我们将文本节点选择为&#34;请参阅&#34;这些行是:
<table>
我们在每次迭代中选择了太多行,所有行都在tr[td/@class="DT"]
的末尾。我们需要一个额外的&#34;结束&#34;跟随行的条件。
我们正在计算循环中的tr[td/@class="DT"]
,因此我们可以检查每行前面row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=1]
的数量:
第一组:
row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=2]
第二名:
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]', count=cnt) )
...
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ec02f98>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab00638>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00688>]
>>>
等
因此,在循环中,我们可以使用带有lxml(an underrated XPath feature supported by lxml)的XPath变量的当前计数:
<tr><td class="DT">30-04-14</td>
嗯,我们在每次迭代中都选择了太多的一行。
那是因为<tr><td class="DT">
还有1个<td class="DT">
我们可以添加一个额外的谓词来选择没有>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('''
... ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
... [not(td/@class="DT")]''', count=cnt) )
...
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>]
[<Element tr at 0x7f948ab00548>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00688>]
>>>
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('''
... ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
... [not(td/@class="DT")]
... /td/text()''', count=cnt) )
...
['Some other text', 'Some more text']
['more text']
['Some other Text', 'More Text', 'Some different text']
>>>
每次迭代的结果数看起来是正确的。 最后使用文本节点检查:
// Setup dummy array
ArrayList<Integer> list = dateArray;
int counter = 1;
outerwhileloop:
while (list.size() != 0) {
for (int j = 1; j < list.size(); j++)
{
//System.out.println(list.get(0) + " and " + list.get(j));
int difference = list.get(0) - list.get(j);
if (difference <6){
System.out.println(list.get(0) + " and " + list.get(j) + " and size is " +list.size() );
counter= counter +1;
System.out.println ("Counter is " + counter);
if (counter >= 4){
System.out.println ("j = " + j + " Counter =" + counter);
if (j ==list.size()-1) {
System.out.println ("here " + counter);
break outerwhileloop;
}
}
}
}
list.remove(0);
};