Question

当我尝试使用以下python代码读取文本文件时：

     with open(file, 'r') as myfile:
          data = myfile.read()

有一些奇怪的角色以\ x ....开头，它们代表什么以及如何在阅读文本文件时摆脱它们？

e.g。

...... \ xc2 \ xa0 \ xc2 \ xa0 1984年第1章星期二\ xe2 \ x80 \ x9chey，杰克，你的妈妈让我去接你了\ xe2 \ x80 \ x9d jacob robbins比知道更好接受一个陌生人的乘车，但当他妈妈的朋友ronny在学校门口等他时，他不情愿地上了车。\ xe2 \ x80 \ x9cmy这个名字是jacob ........

Answer 1

这是UTF-8编码的文本。您将文件打开为UTF-8。

with open(file, 'r', encoding='utf-8') as myfile:
   ...

2.x的：

with codecs.open(file, 'r', encoding='utf-8') as myfile:
   ...

Unicode In Python, Completely Demystified

Answer 2

这些是字符串转义。它们以十六进制值表示字符。例如，\x24是0x24，这是美元符号。

>>> '\x24'
'$'
>>> chr(0x24)
'$'

一个这样的逃避（来自你提供的逃脱）是\xc2，Â，是一个带有抑扬符号的大写字母A.

Answer 3

下面的代码解决了这个问题

path.decode('utf-8','ignore').strip()

Answer 4

 def main():
      args = parse_args()
      if args.file :
          //To clean \xc2\xa0 \xc2\xa0… in text data 
          file_to_read = args.file.decode('utf-8','ignore').strip() 
          f = open(file_to_read, "r+")
          text_from_file = f.read()  
      else :
          text_from_file = sys.argv[1]

如何清理文本数据中的\ xc2 \ xa0 \ xc2 \ xa0 .....

4 个答案: