我试图抓取以下HTML代码的标题:
<FONT COLOR=#5FA505><B>Claim:</B></FONT> Coed makes unintentionally risqué remark about professor's "little quizzies."
<BR><BR>
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER>
我尝试过使用:
def parse_article(self, response):
for href in response.xpath('//font[@color="#5FA505"]/'):
但是标题(Coed无意中......)实际上并没有嵌入任何标签中,因此我无法真正获得该内容。有没有一种方法可以在没有嵌入<p>
或任何类型的标签的情况下获取内容?
编辑://font[b = "Claim:"]/following-sibling::text()
有效,但它也抓住并显示这个底层的HTML。
<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends
答案 0 :(得分:1)
假设您事先知道Claim:
文字,请找font
个b
儿标的文字,然后获取 following text sibling :
//font[b = 'Claim:']/following-sibling::text()
来自Scrapy Shell:
的演示In [1]: "".join(map(unicode.strip, response.xpath("//font[b = 'Claim:']/following-sibling::text()").extract()))
Out[1]: u'Coed makes unintentionally risqu\xe9 remark about professor\'s "little quizzies."'
请注意,这些join和strip调用应理想地由Item Loaders中使用的相应输入或输出处理器替换。