Question

我得到了这个HTML（简化）：

<td class="pad10">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
</td>

我想得到包含的dict结构（行表示表格内容由主表中的日期分隔）：

{'04.09.2013': [1 row, 2 row],

 '05.10.2013': [1 row, 2 row, 3 row, 4 row]}

我可以用：

提取所有'div'

dt = s.xpath（'// div [contains（@class，“button-left”）]'）

我可以用：

提取所有'表'

tables = s.xpath（'// table [contains（@class，“record generic schedule）余量-4" ）]'）

但我不知道如何在Scrapy解析器中将'dt'与相应的'tables'相关联。可以在抓取过程中创建一个条件，如下所示：如果你找到'div'，那么你提取所有下一个'table'，直到找到其他'div'为止？

使用Chrome，我会获得这些元素的两个xPath示例：

//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/div[2]
//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/table[1]

也许它有助于成像表的完整结构。

解决方案（感谢@marven）：

    s = Selector(response)

    table = {}
    current_key = None
    for e in s.xpath('//td[@class="pad10"]/*') :

        if bool(int(e.xpath('@class="button-left"').extract()[0])):
            current_key  = e.xpath('text()').extract()[0]
        else:
            if bool(int(e.xpath('@class="record generic schedule margin-4"').extract()[0])):
               t = e.extract()
               if current_key in table:
                   table[current_key].append(t)
               else:
                   table[current_key] = [t]
            else:
                pass

Answer 1

使用该特定格式，您可以这样做：

获取父表：t = s.xpath（＆＃39; // div [contains（@class，＆＃34; button-left＆＃34;）] /..'）

获取第一个div：t.xpath（＆＃39; / div [1]＆＃39;） - 您可能必须使用position（）= 1

获取前两行：t.xpath（＆＃39; / table [position（）＆lt; 3]＆＃39;）

获得第二个div：t.xpath（＆＃39; / div [2]＆＃39;）

获取其余表：t.xpath（＆＃39; / table [position（）＆gt; 2＆＃39;）

这非常脆弱，如果这个html改变了，这个代码就不会起作用了。使用您提供的简化html很难回答这个问题，并且不知道这个结构是否是静态的，或者将来是否会发生变化。我会在评论中提出这些问题，但我没有足够的代表：P

来源：

How to read attribute of a parent node from a child node in XSLT

What is the xpath to select a range of nodes?

https://stackoverflow.com/a/2407881/2368836

Answer 2

查看此方法是否适用于您的案例：XPATH get all nodes between text_1 and text_2

使用与上述链接问题相同的方法，基本上我们可以将<table>仅过滤为具有前兄弟和后兄弟特定<div>的人。例如（使用您发布的用于获取<table>和<div> s的XPath条件：

//table
    [contains(@class, "record generic schedule margin-4")]
    [
        preceding-sibling::div[contains(@class, "button-left")] 
            and 
        following-sibling::div[contains(@class, "button-left")]
    ]

Answer 3

您可以做的是选择所有节点并循环浏览它们，同时检查当前节点是div还是table。

使用它作为我的测试用例，

<div class="asdf">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4">1</table>
  <table width="100%" class="record generic schedule margin-4">2</table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4">3</table>
  <table width="100%" class="record generic schedule margin-4">4</table>
  <table width="100%" class="record generic schedule margin-4">5</table>
  <table width="100%" class="record generic schedule margin-4">6</table>
</div>

我使用以下内容遍历节点并更新当前节点当前“在”下的div。

currdiv = None
mydict = {}
for e in sel.xpath('//div[@class="asdf"]/*'):
    if bool(int(e.xpath('@class="button-left"').extract()[0])):
        currdiv = e.xpath('text()').extract()[0]
        mydict[currdiv] = []
    elif currdiv is not None:
        mydict[currdiv] += e.xpath('text()').extract()

这导致：

{u'04.09.2013': [u'1', u'2'], u'05.10.2013': [u'3', u'4', u'5', u'6']}

如何用Scrapy在同一级别上刮取不同xpath的表格？

3 个答案: