我正在尝试将二进制文件转换为可读格式,但无法这样做,请建议如何实现。
$ file test.docx
test.docx: Microsoft Word 2007+
$ file -i test.docx
test.docx: application/msword; charset=binary
$
>>> raw = codecs.open('test.docx', encoding='ascii').readlines()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/Python/installPath/lib/python2.7/codecs.py", line 694, in readlines
return self.reader.readlines(sizehint)
File "/home/Python/installPath/lib/python2.7/codecs.py", line 603, in readlines
data = self.read()
File "/home/Python/installPath/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 18: ordinal not in range(128)
答案 0 :(得分:0)
你必须以二进制模式阅读它:
import binascii
with open('test.docx', 'rb') as f: # 'rb' stands for read binary
hexdata = binascii.hexlify(f.read()) # convert to hex
print(hexdata)
答案 1 :(得分:0)
尝试以下代码Working with Binary Data
with open("test_file.docx", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)
# Seek position and read N bytes
binary_file.seek(0) # Go to beginning
couple_bytes = binary_file.read(2)
print(couple_bytes)
答案 2 :(得分:0)
我认为其他人没有回答这个问题-至少@ankitpandey在有关catdoc返回错误的评论中阐明的部分
“ catdoc,然后错误是此文件看起来像ZIP存档或Office 2007 或更高版本的文件。 catdoc不支持”
我也刚刚在catdoc上遇到了同样的问题,找到了适合我的解决方案
提到.zip存档是一个线索-我能够执行以下命令
unzip -q -c 'test.docx' word/document.xml | python etree.py
将test.docx的文本部分提取到stdout
将python代码放在etree.py中
from lxml import etree
import sys
xml = sys.stdin.read().encode('utf-8')
root = etree.fromstring(xml)
bits_of_text = root.xpath('//text()')
# print(bits_of_text) # Note that some bits are whitespace-only
joined_text = ' '.join(
bit.strip() for bit in bits_of_text
if bit.strip() != ''
)
print(joined_text)