Question

您好我有一个包含名为0_data，0_index等文件的tar文件。我要做的是打开tar文件并阅读这些文件的内容。到目前为止，我能做的就是提取所有文件。我不能做的是阅读各个文件的内容。我知道它们不是纯文本文件，但如果我看不到文件的内容，我该如何解析一堆网页的文件？

我尝试打开文件时收到的错误是：

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 87: character maps to <undefined>

这是我的代码：

import os
import tarfile

def is_tarfile(file):
return tarfile.is_tarfile(file)

def extract_tarfile(file):
    if is_tarfile(file):
        my_tarfile=tarfile.open(file)
        my_tarfile.extractall("c:/untar")
        read_files_nz2("c:/untar/nz2_merged");
        return 1
    return 0

def read_files_nz2(file):
    for subdir, dirs, files in os.walk(file):
        for i in files:
             path = os.path.join(subdir,i)
             print(path)
             content=open(path,'r')
             print (content.read())

extract_tarfile("c:/nz2.tar")

print(i)将输出文件名，但print(content.read())会出错：

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 87: character maps to <undefined>

我希望有人可以帮我阅读文件中的数据

Answer 1

我不是100％确定这是你的问题，但这至少是不好的做法，也可能是你问题的根源。

您没有关闭任何打开的文件。例如，你有：

my_tarfile=tarfile.open(file)

但在此之后的某个地方，在您打开另一个文件之前，您应该：

my_tarfile.close()

以下是diveintopython的引用：

打开文件会占用系统资源，并且根据文件模式，其他程序可能无法访问它们。一旦你完成了文件，关闭文件就很重要。

我的想法是因为你永远不会关闭my_tarfile，系统无法正确读取从中提取的文件。即使不是问题，最好尽快关闭文件。

Answer 2

您需要一个完整的文件路径才能访问它，而不仅仅是名称。你的第二个功能应该是：

def read_files_nz2(file):
for subdir, dirs, files in os.walk(file):
    for i in files:
        path = os.path.join(subdir, f) # Getting full path to the file
        content=open(path,'r')
        print (content.read())

Answer 3

您需要做以下两件事之一：

在您打开文件时指定编码：

# This is probably not the right encoding.
content = open(path, 'r', encoding='utf-8')

为此，您需要知道文件的编码是什么。

以二进制模式打开文件：
```
content = open(path, 'rb')
```
这将导致read返回bytes对象而不是字符串，但它将避免任何尝试解码或验证单个字节。

Answer 4

我不确定是什么问题但这种情况发生在我身上，它使用这种编码解决了

with open(ff_name, 'rb') as source_file:
  with open(target_file_name, 'w+b') as dest_file:
    contents = source_file.read()
    dest_file.write(contents.decode('utf-16').encode('utf-8'))

另一个好方法是用 UTF-8 重写你的文件，检查这段代码

{{1}}

无法读取文件python

4 个答案: