Question

我有一个HTML文档，我想用常规引号替换所有智能引号。我试过这个：

for text_element in html.findAll():
    content = text_element.string
    if content:
        new_content = content \
            .replace(u"\u2018", "'") \
            .replace(u"\u2019", "'") \
            .replace(u"\u201c", '"') \
            .replace(u"\u201d", '"') \
            .replace("e", "x")
        text_element.string.replaceWith(new_content)

（使用e / x转换只是为了便于查看是否有效）

但这是我的输出：

<p>
 This amount of investment is producing results: total final consumption in IEA countries is estimated to be
   <strong>
      60% lowxr
   </strong>
 today because of energy efficiency improvements over the last four decades. This has had the effect of
   <strong>
      avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
   </strong>
 .
</p>

似乎BS正在深入研究最常见的标签，但我需要在整个页面中获取所有文本。

Answer 1

这样可行，但也许有一种更清洁的方式：

for text_element in html.findAll():
    for child in text_element.contents:
        if child:
            content = child.string
            if content:
                new_content = remove_smart_quotes(content)
                child.string.replaceWith(new_content)

Answer 2

您可以通过为string argument指定True来直接选择文本节点，而不是选择和过滤所有元素/标记：

for text_node in soup.find_all(string=True):
  # do something with each text node

正如文档所述，string参数是4.4.0版本中的新参数，这意味着您可能需要使用text参数，具体取决于您的版本：

for text_node in soup.find_all(text=True):
  # do something with each text node

以下是替换值的相关代码：

def remove_smart_quotes (text):
  return text.replace(u"\u2018", "'") \
             .replace(u"\u2019", "'") \
             .replace(u"\u201c", '"') \
             .replace(u"\u201d", '"')

soup = BeautifulSoup(html, 'lxml')

for text_node in soup.find_all(string=True):
  text_node.replaceWith(remove_smart_quotes(text_node))

作为旁注，Beautiful Soup文档实际上有一个section on smart quotes。

替换Beautiful Soup中的所有智能引号

2 个答案: