Question

我正在解析和抓取的HTML包含以下代码：

<li> <span> 929</span> Serve Returned </li>

如何仅提取<li>的文本节点，＆＃34;提供已退回的＆＃34;在这种情况下使用Beautifulsoup？

由于.string有子元素，

<li>无法正常工作，.text会返回<span>内的文字。

Answer 1

我使用了str.replace方法：

>>> li = soup.find('li') # or however you need to drill down to the <li> tag 
>>> mytext = li.text.replace(li.find('span').text, "") 
>>> print mytext
Serve Returned

Answer 2

import bs4
html = r"<li> <span> 929</span> Serve Returned </li>"
soup = bs4.BeautifulSoup(html)
print soup.li.findAll(text=True, recursive=False)

这给出了：

[u' ', u' Serve Returned ']

第一个元素是＆＃34; text＆＃34;你有跨度。此方法可以帮助您在任何子元素之前和之后（以及之间）查找文本。

在beautifulsoup4中具有子元素的标记内提取文本节点

2 个答案: