Question

使用Python3，希望os.walk文件目录，将它们读入二进制对象（字符串？）并对它们进行进一步处理。不过第一步：如何阅读os.walk的文件结果？

# NOTE: Execute with python3.2.2

import os
import sys

path = "/home/user/my-files"

count = 0
successcount = 0
errorcount = 0
i = 0

#for directory in dirs
for (root, dirs, files) in os.walk(path):
 # print (path)
 print (dirs)
 #print (files)

 for file in files:

   base, ext = os.path.splitext(file)
   fullpath = os.path.join(root, file)

   # Read the file into binary? --------
   input = open(fullpath, "r")
   content = input.read()
   length = len(content)
   count += 1
   print ("    file: ---->",base," / ",ext," [count:",count,"]",  "[length:",length,"]")
   print ("fullpath: ---->",fullpath)

ERROR：

Traceback (most recent call last):
  File "myFileReader.py", line 41, in <module>
    content = input.read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 11: invalid continuation byte

Answer 1

要读取二进制文件，必须以二进制模式打开文件。变化

input = open(fullpath, "r")

到

input = open(fullpath, "rb")

read（）的结果将是一个bytes（）对象。

Answer 2

由于您的某些文件是二进制文件，因此无法将它们成功解码为Python 3用于存储解释器中所有字符串的unicode字符。请注意，Python 2和Python 3之间的大的变化涉及将字符串表示从ASCII转移到unicode字符，这意味着每个字符不能简单地被视为一个字节（是的，Python 3中的文本字符串需要2x 或者是Python 2存储的4倍，因为UTF-8每个字符最多使用4个字节。）

因此，您有许多选项取决于您的项目：

忽略二进制文件，按文件扩展名过滤
读取二进制文件，并在发生时捕获解码异常，并跳过该文件，或使用此线程中描述的方法之一How can I detect if a file is binary (non-text) in python?

在这种情况下，您可以编辑解决方案以捕获UnicodeDecode错误并跳过该文件。

无论您的决定如何，重要的是要注意，如果系统中的文件中存在大量不同的字符编码，则需要指定编码，因为Python 3.0将假定字符以UTF编码8。

作为参考，有关Python 3 I / O的精彩演示：http://www.dabeaz.com/python3io/MasteringIO.pdf

如何从文件中读取文件内容？

2 个答案: