Question

我正在尝试在包含许多表格的网站上获取文字。最终我想把它做到它做同一布局的多个页面的地方。问题是表的xpath可以改变。 xpath在一个页面上可能是table 3, row 4的位置，在另一个页面上，对于我需要的信息，它可能是table 2, row 5。如果它包含某个文本，我如何编写一个选择表的xpath，如果它包含某个文本，则写入该行，最后是结束文本。

例如：

html代码段如下所示：

<table>
    <thead>
        <tr>
            <th colspan="2">
                <b>Table Blah</b>
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th width="133" id="sub">
                <p align="right">
                    <b>Row Blah</b>
                </p>
            </th>
            <td>Get Me!</td>
        </tr>
    </tbody>
</table>

如果<thead>包含文字Table Blah，而<tr>中的<tbody>包含文字Row Blah，则抓取文字Get Me!在Row Blah＆＃39; s <tr>

内

Answer 1

“如果<thead>包含文字Table Blah，而<tr>中的<tbody>包含文字Row Blah，则抓取文字{{1}在Get Me!的{{1}}“
中

将上述描述翻译成XPath（为便于阅读而格式化）：

Row Blah

Answer 2

您可以编写单个XPath表达式并访问Get me!：

//table[contains(thead/tr/th/b, 'Table Blah')]/tbody/tr[contains(th/p/b, 'Row Blah')]/td/text()

来自shell的演示（index.html包含问题中的相同数据）：

$ scrapy shell index.html
In [1]: response.xpath("//table[contains(thead/tr/th/b, 'Table Blah')]/tbody/tr[contains(th/p/b, 'Row Blah')]/td/text())").extract()
Out[1]: [u'Get Me!']

使用scrapy在表中查找正确的数据

2 个答案: