Question

这是有问题的html字符串。

<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>

使用BeautifulSoup，此代码

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text

得到我

一本语法规则书：

这正是我想要的。

一头雾水，我如何得到相同的结果？

from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()

这段代码让我

['a'，'of语法'，'：']

我应该如何解决？

Answer 1

a您可以使用以下代码获取div及其子级中的所有文本：

text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)

您的选择器仅从div返回文本，但是部分文本位于子元素（a）内，这就是为什么您必须在::text之前添加空格以将子文本包含到结果中的原因。

使用Scrapy从html字符串中提取文本

1 个答案: