Question

我试图抓取以下HTML代码的标题：

<FONT COLOR=#5FA505><B>Claim:</B></FONT> &nbsp; Coed makes unintentionally risqu&eacute; remark about professor's "little quizzies."
<BR><BR>
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER>

我尝试过使用：

def parse_article(self, response):
                for href in response.xpath('//font[@color="#5FA505"]/'):

但是标题（Coed无意中......）实际上并没有嵌入任何标签中，因此我无法真正获得该内容。有没有一种方法可以在没有嵌入<p>或任何类型的标签的情况下获取内容？

编辑：//font[b = "Claim:"]/following-sibling::text()有效，但它也抓住并显示这个底层的HTML。

<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> &nbsp; Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends

Answer 1

假设您事先知道Claim:文字，请找font个b儿标的文字，然后获取 following text sibling ：

//font[b = 'Claim:']/following-sibling::text()

来自Scrapy Shell：

的演示

In [1]: "".join(map(unicode.strip, response.xpath("//font[b = 'Claim:']/following-sibling::text()").extract()))
Out[1]: u'Coed makes unintentionally risqu\xe9 remark about professor\'s "little quizzies."'

请注意，这些join和strip调用应理想地由Item Loaders中使用的相应输入或输出处理器替换。

如何用Scrapy刮掉无标签段落

1 个答案: