Question

可能重复：
Character reading from file in Python

我想从所有特殊字符中删除文件中的输入字符串，但实际字母除外（甚至不应删除西里尔字母）。我找到的解决方案手动将字符串声明为unicode，并使用re.UNICODE标志模式，以便检测来自不同语言的实际字母。

# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works

因此，如果我直接在源代码中编写字符串并手动将其定义为Unicode，它会为我提供所需的输出，而非Unicode字符串只会给我带来垃圾：

"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters

我现在的问题是如何将文件输入定义为Unicode？

Answer 1

现在我的问题是如何将文件中的输入定义为unicode？

直接来自the docs。

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

如何将从文件读取的字符串定义为Unicode？

1 个答案: