我有一个HTML文档,我想用常规引号替换所有智能引号。我试过这个:
for text_element in html.findAll():
content = text_element.string
if content:
new_content = content \
.replace(u"\u2018", "'") \
.replace(u"\u2019", "'") \
.replace(u"\u201c", '"') \
.replace(u"\u201d", '"') \
.replace("e", "x")
text_element.string.replaceWith(new_content)
(使用e / x转换只是为了便于查看是否有效)
但这是我的输出:
<p>
This amount of investment is producing results: total final consumption in IEA countries is estimated to be
<strong>
60% lowxr
</strong>
today because of energy efficiency improvements over the last four decades. This has had the effect of
<strong>
avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
</strong>
.
</p>
似乎BS正在深入研究最常见的标签,但我需要在整个页面中获取所有文本。
答案 0 :(得分:0)
这样可行,但也许有一种更清洁的方式:
for text_element in html.findAll():
for child in text_element.contents:
if child:
content = child.string
if content:
new_content = remove_smart_quotes(content)
child.string.replaceWith(new_content)
答案 1 :(得分:0)
您可以通过为string
argument指定True
来直接选择文本节点,而不是选择和过滤所有元素/标记:
for text_node in soup.find_all(string=True):
# do something with each text node
正如文档所述,string
参数是4.4.0版本中的新参数,这意味着您可能需要使用text
参数,具体取决于您的版本:
for text_node in soup.find_all(text=True):
# do something with each text node
以下是替换值的相关代码:
def remove_smart_quotes (text):
return text.replace(u"\u2018", "'") \
.replace(u"\u2019", "'") \
.replace(u"\u201c", '"') \
.replace(u"\u201d", '"')
soup = BeautifulSoup(html, 'lxml')
for text_node in soup.find_all(string=True):
text_node.replaceWith(remove_smart_quotes(text_node))
作为旁注,Beautiful Soup文档实际上有一个section on smart quotes。