Question

我想处理如下句子： “礼物花费近100英镑”

该句子位于文本文件中。我在python中阅读它，当我打印它时，我得到：

print "text",text
text The gift costs nearly Â£100.

我尝试用代码替换它（当我完成处理时，我会使用另一个函数unmapstrangechars来获取原始数据）：

def mapstrangechars(text):
    text = text.replace("Â£","1pound1 ")
    return text 

def unmapstrangechars(text):
    text = text.replace("1pound1 ","Â£")    
    return text

但是我确实得到一个错误，说££不是acii字符。我该如何解决？

至少要了解我如何用这个char的特定内容替换非acii char会有所帮助，所以我可以恢复它的字母。例如：原件：礼品价格接近100英镑。 copy1：礼品价格接近11磅11 100。产量：礼品价格接近100美元。

输出实际上是：

print text

整个代码（在txt文件中，它说“礼物花费近100英镑。”）：

if 1==1:     
    import os
    script_dir = os.path.dirname(os.path.realpath(__file__))
    rel_path = "results/article.txt"
    abs_file_path = os.path.join(script_dir, rel_path)       
    thefile = open(abs_file_path)
    text = thefile.read()


    print "text",text


    def mapstrangechars(text):
        #text = text.replace("fdfdsfds","1pound1 ")
        return text

    def unmapstrangechars(text):
        #text = text.replace("1pound1 ","fdfdsfds")    
        return text  

    text = mapstrangechars(text)

    #process the text

    text = unmapstrangechars(text)    
    print "text",text  #this is output

Answer 1

这是因为您的文本文件的编码是＆＃39; utf-8＆＃39;，但您的终端/ IDE可能采用Windows-1252编码。

在UTF-8中，井号被编码为两个字节：0xc2 0xa3 如果您在十六进制编辑器中打开文件，这就是您所看到的。

当您打印它时，您的终端/ IDE正在将0xc2 0xa3解释为windows-1252。与其他8位代码页一样，windows-1252期望每个字节映射到一个字符。因此，当0xc2 0xa3被解释为windows-1252并且每个字节都映射到一个字符时，会发生以下情况：

0xc2显示为Â
0xa3显示为£

解决方案是将文本文件解码为名为＆＃34; Unicode String＆＃34;的特殊Python字符串类型。一旦你有了一个Python Unicode字符串，Python就能够为你的终端类型重新编码它。即，Python将解码UTF-8，然后编码为windows-1252。

要实现此目的，请使用io模块open()方法并传入encoding属性：

import io
thefile = io.open(abs_file_path, encoding="utf-8")

当您从read() thefile时，您将获得<type 'unicode'>。它将像常规字符串一样运行。当您将其传递给print时，Python会自动对其进行编码，使其显示在您的终端上。

您不再需要mapstrangechars()和unmapstrangechars()

注意：这是 Python 2.x 特有的，其中open()默认以二进制模式打开。默认情况下，Python 3以文本模式打开，如果没有给出，将使用区域设置/语言设置来确定正确的编码。

处理非ascii字符，例如python中的pound

1 个答案: