Question

我正在尝试解析表格中的某些内容，如下所示：

<table class="dataTbl col-4">
                        <tr>
                            <th scope="row">Rent</th>
                            <td>5.5</td>
                            <th scope="row">Management</th>
                            <td>3.3</td>
                        </tr>
                        <tr>
                            <th scope="row">Deposit</th>
                            <td>No</td>
                            <th scope="row">Other</th>
                            <td>No</td>
                        </tr>
                        <tr>
                            <th scope="row">Other2</th>
                            <td>No</td>
                            <th scope="row">Insurance</th>
                            <td>Yes</td>
                        </tr>
                                            </table>

我的目标是找到特定的行（例如，Rent），如果匹配，则在下一个<td>标记中提取内容（例如，5.5）。

但我怎么能在Python中做到这一点？

我正在使用Python3 / Scrapy 1.3.0。

由于

Answer 1

In [9]: Selector(text=html).xpath('//th[text()="Rent"]/following-sibling::td[1]').extract()
Out[9]: ['<td>5.5</td>']

使用text()="Rent"来识别th代码
使用following-sibling::获取它的兄弟姐妹并使用[1]获取第一个

Answer 2

使用python的正则表达式。

r'\>text\<.+\n +\<td\>(\d+\.\d+)'

在您的情况下，请按租金更改文本。另外，this是调试正则表达式的有用网页。

如何使用Scrapy

2 个答案: