python - 挣扎着字符串编码和重音引号/撇号

时间:2013-02-25 21:17:53

标签: python unicode character-encoding beautifulsoup

我有一个简单的RSS提要脚本,它可以获取每篇文章的内容,并在将其保存到数据库之前通过一些简单的处理运行它。

问题是,在通过以下内容运行文本后,所有重音的撇号和引号都会从文本中删除。

# this is just an example string, I use feed_parser to download the feeds
string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>"""

text = BeautifulSoup(string)
# does some simple soup processing

string = text.renderContents()
string = string.decode('utf-8', 'ignore')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print "".join([x for x in string if ord(x)<128])

结果是:

> <p>  </p><p>This is a sentence. This is a sentence. I'm a programmer. Im a programmer, however I dont graphic design.</p>

所有html实体引号/撇号都被删除了。我该如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

以下代码对我有用,您可能错过了convertEntities构造函数的BeautifulSoup参数:

string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>"""

text = BeautifulSoup(string, convertEntities=BeautifulSoup.HTML_ENTITIES) # See the converEntities argument
# does some simple soup processing

string = text.renderContents()
string = string.decode('utf-8')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
# I don't know why your are doing this
#string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print string