Question

我正在尝试使用python从网站上获取一些中文文本。当我得到它时，它被html标签包围，就像这样：

我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.

（我不得不把它作为代码来防止html标签消失）但是，一旦我使用切片来摆脱html标签，我得到：

我今天的心情ㅈヘ好。

为什么这个奇怪的角色出现在倒数第二的位置？谢谢你的帮助！

Answer 1

使用regex模块，您可以使用unicode category \p{Han}过滤中文字符：

>>> text = u'''我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.'''
>>> import regex
>>> print u''.join(regex.findall(r'\p{Han}+', text, flags=regex.UNICODE))
我今天的心情不好

或者，使用unicodedata.name：

>>> import unicodedata
>>> unicodedata.name(u'a')
'LATIN SMALL LETTER A'
>>> unicodedata.name(u'我')
'CJK UNIFIED IDEOGRAPH-6211'
>>> unicodedata.name(u'今')
'CJK UNIFIED IDEOGRAPH-4ECA'

>>> text = u'''我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.'''
>>> print u''.join(c for c in text if unicodedata.name(c).startswith('CJK'))
我今天的心情不好

将unicode转换为中文

1 个答案: