Question

我正在尝试将二进制文件转换为可读格式，但无法这样做，请建议如何实现。

$ file test.docx
test.docx: Microsoft Word 2007+
$ file -i test.docx
test.docx: application/msword; charset=binary
$

>>> raw = codecs.open('test.docx', encoding='ascii').readlines()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/Python/installPath/lib/python2.7/codecs.py", line 694, in readlines
    return self.reader.readlines(sizehint)
  File "/home/Python/installPath/lib/python2.7/codecs.py", line 603, in readlines
    data = self.read()
  File "/home/Python/installPath/lib/python2.7/codecs.py", line 492, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 18: ordinal not in range(128)

Answer 1

你必须以二进制模式阅读它：

import binascii
with open('test.docx', 'rb') as f: # 'rb' stands for read binary
    hexdata = binascii.hexlify(f.read()) # convert to hex
    print(hexdata)

Answer 2

尝试以下代码Working with Binary Data

with open("test_file.docx", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)

# Seek position and read N bytes
binary_file.seek(0)  # Go to beginning
couple_bytes = binary_file.read(2)
print(couple_bytes)

Answer 3

我认为其他人没有回答这个问题-至少@ankitpandey在有关catdoc返回错误的评论中阐明的部分

“ catdoc，然后错误是此文件看起来像ZIP存档或Office 2007 或更高版本的文件。 catdoc不支持”

我也刚刚在catdoc上遇到了同样的问题，找到了适合我的解决方案

提到.zip存档是一个线索-我能够执行以下命令

unzip  -q -c 'test.docx' word/document.xml  | python etree.py

将test.docx的文本部分提取到stdout

将python代码放在etree.py中

from lxml import etree
import sys

xml = sys.stdin.read().encode('utf-8')
root = etree.fromstring(xml)

bits_of_text = root.xpath('//text()')
# print(bits_of_text)  # Note that some bits are whitespace-only
joined_text = ' '.join(
    bit.strip() for bit in bits_of_text
    if bit.strip() != ''
)
print(joined_text)

如何在linux服务器上将二进制文件转换为可读格式

3 个答案: