Question

       <div class="jokeContent">
            <h2 style="color:#369;">Can I be Frank</h2>
            What did Ellen Degeneres say to Kathy Lee? 
           <p></p> <p>Can I be Frank with you? </p> 
           <p>Submitted by Calamjo</p> 
           <p>Edited by Curtis</p>      
       <div align="right" style="margin-top:10px;margin-bottom:10px;">#joke <a href="http://www.jokesoftheday.net/tag/short-jokes/">#short</a> </div>
       <div style="clear:both;"></div>
    </div>

所以我试图在＆lt; \ h2＆gt;之后提取所有文字。在[div aign =＆＃34; right＆＃34; style = ...]节点。到目前为止我尝试了什么：

    jokes = response.xpath('//div[@class="jokeContent"]')
    for joke in jokes:
        text = joke.xpath('text()[normalize-space()]').extract()]
        if len(text) > 0:
            yield text

这在一定程度上有效，但网站在html中不一致，有时文本嵌入在＆lt; .p＆gt;中文字＆lt; \ p＆gt;有时候在＆lt; .br＆gt;文字＆lt; \ br＆gt;或者只是文字。所以我想只是提取标题之后的所有内容，然后在样式节点可能有意义之前，然后可以在文字之后进行过滤。

Answer 1

如果你正在寻找你所描述的文字xpath，它可能是这样的：

In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

但可能会有更合乎逻辑，更清晰的结论：

In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

这只是选择段落标签。您说段落标记可能是其他内容，您可以使用self::tag规范匹配多个不同的标记：

In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

编辑：显然我错过了div本身下的文字。这可以通过| - 或选择器：

进行修改

In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]: 
[u'\n            What did Ellen Degeneres say to Kathy Lee? \n           ',
 u'Can I be Frank with you? ',
 u'Submitted by Calamjo',
 u'Edited by Curtis']

normalize-space(.)只能删除不包含文字的文字值（例如'\ n'）。
你可以将这个xpath的第一部分附加到上面的任何一个，你会得到类似的结果。

使用xpath进行websrcaping提取两个节点之间的所有文本？

1 个答案: