我试图将HTML文件中的所有(用户可见)文本大写。这是显而易见的事情:
from bs4 import BeautifulSoup
def upcaseAll(str):
soup = BeautifulSoup(str)
for tag in soup.find_all(True):
for s in tag.strings:
s.replace_with(unicode(s).upper())
return unicode(soup)
崩溃:
File "/Users/malvolio/flip.py", line 23, in upcaseAll
for s in tag.strings:
File "/Library/Python/2.7/site-packages/bs4/element.py", line 827, in _all_strings
for descendant in self.descendants:
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1198, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
我能想到的所有变化都以同样的方式崩溃。当我替换很多NavigableStrings时,BS4似乎不喜欢它。我怎么能这样做?
答案 0 :(得分:2)
你不应该使用str
作为函数参数,因为这是python builtin的影子名称。
此外,您应该只需使用带有格式化程序的prettify
来转换可见元素:
...
return soup.prettify(formatter=lambda x: unicode(x).upper())
我现在已经测试过并且有效:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.stackoverflow.com')
soup = BeautifulSoup(r.content)
print soup.prettify(formatter=lambda x: unicode(x).upper())[:200]
<!DOCTYPE html>
<html>
<head>
<title>
STACK OVERFLOW
</title>
<link href="//CDN.SSTATIC.NET/STACKOVERFLOW/IMG/FAVICON.ICO?V=00A326F96F68" rel="SHORTCUT ICON"/>
<link href="//CDN.SSTATIC.NE
...
您可以阅读OUTPUT FORMATTER了解更多详细信息。
希望这有帮助。