string = "Special $#! characters spaces 888323 Kek ཌི ༜ 郭 ༜ དྀ "
结果应为:“Specialcharactersspaces888323Kek郭”
我尝试过
print ''.join(c for c in string.decode('utf-8') if u'\u4e00' <= c <= u'\u9fff')
但错误返回
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u90ed' in position 4
9: ordinal not in range(128)
我的问题与标题相同,
删除特殊的chac,间距但不是中国字符
答案 0 :(得分:1)
使用re.compile和re.sub函数的解决方案:
import re
string = "Special $#! characters spaces 888323 Kek ཌི ༜ 郭 ༜ དྀ "
# defining the pattern which should match all characters excepting alphanumeric and chinese
pattern = re.compile(u'[^a-z0-9⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.UNICODE | re.IGNORECASE)
result = pattern.sub('', string)
# print(result) Python v.3 printing
print result
输出:
Specialcharactersspaces888323Kek郭