Question

我正在使用scrapy抓取具有与以下类似结构的网站：

<table>
    <td>
        <p>Some text</p>
    </td>
    <td>
        <p>
            <strong>More Text</strong>
            <br />Another Text
        </p>
    </td>
    ...
</table>

我能够使用类似//p//text().extract()的东西来刮擦不同

标记内的所有文本，问题是这会在结果中拆分同一标记内的元素：

'text': ['Some text', 'More Text', 'Another Text']

理想情况下，我需要这样：

'text': ['Some text', 'More Text Another Text']

有可能得到这样的结果吗？

Answer 1

在这些情况下，我可以执行以下技巧：

    Private Sub notificate(msg As String, Optional title As String)
        ofEvent = False

        lbl_notification.Caption = msg
        If Not IsEmpty(title) Then
            frm_notification.Caption = title
        Else
            frm_notification.Caption = ""
        End If

        frm_notification.Visible = True
        frm_notification.Transparency = 0.5

        For i = 1 To 1000000
            DoEvents
            If ofEvent Then
                GoTo GoOn
            End If
        Next i

    GoOn:

        frm_notification.Visible = False

        frm_notification.Caption = ""
        lbl_notification.Caption = ""
    End Sub

这将为您提供想要的东西。

Answer 2

另一种方法是使用XPath string()（稍后可能需要strip()使用）：

text = response.xpath('string(//p)').extract()

从scrapy的p标签中提取所有元素

2 个答案: