替换Beautiful Soup中的所有智能引号

时间:2017-02-24 17:15:49

标签: python beautifulsoup

我有一个HTML文档,我想用常规引号替换所有智能引号。我试过这个:

for text_element in html.findAll():
    content = text_element.string
    if content:
        new_content = content \
            .replace(u"\u2018", "'") \
            .replace(u"\u2019", "'") \
            .replace(u"\u201c", '"') \
            .replace(u"\u201d", '"') \
            .replace("e", "x")
        text_element.string.replaceWith(new_content)

(使用e / x转换只是为了便于查看是否有效)

但这是我的输出:

<p>
 This amount of investment is producing results: total final consumption in IEA countries is estimated to be
   <strong>
      60% lowxr
   </strong>
 today because of energy efficiency improvements over the last four decades. This has had the effect of
   <strong>
      avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
   </strong>
 .
</p>

似乎BS正在深入研究最常见的标签,但我需要在整个页面中获取所有文本。

2 个答案:

答案 0 :(得分:0)

这样可行,但也许有一种更清洁的方式:

for text_element in html.findAll():
    for child in text_element.contents:
        if child:
            content = child.string
            if content:
                new_content = remove_smart_quotes(content)
                child.string.replaceWith(new_content)

答案 1 :(得分:0)

您可以通过为string argument指定True来直接选择文本节点,而不是选择和过滤所有元素/标记:

for text_node in soup.find_all(string=True):
  # do something with each text node

正如文档所述,string参数是4.4.0版本中的新参数,这意味着您可能需要使用text参数,具体取决于您的版本:

for text_node in soup.find_all(text=True):
  # do something with each text node

以下是替换值的相关代码:

def remove_smart_quotes (text):
  return text.replace(u"\u2018", "'") \
             .replace(u"\u2019", "'") \
             .replace(u"\u201c", '"') \
             .replace(u"\u201d", '"')

soup = BeautifulSoup(html, 'lxml')

for text_node in soup.find_all(string=True):
  text_node.replaceWith(remove_smart_quotes(text_node))

作为旁注,Beautiful Soup文档实际上有一个section on smart quotes