我是python的新手,我们一直试图在程序中使用GIT中的lzw代码。 https://github.com/joeatwork/python-lzw/blob/master/lzw/init.py
如果我们的blob较小,则效果很好,但如果blob大小增加,则不会解压缩blob。因此,我一直在阅读文档,但无法理解以下内容,这可能是完整的Blob无法解压缩的原因。
我还附上了我正在使用的python代码带。
Our control codes are
- CLEAR_CODE (codepoint 256). When this code is encountered, we flush
the codebook and start over.
- END_OF_INFO_CODE (codepoint 257). This code is reserved for
encoder/decoders over the integer codepoint stream (like the
mechanical bit that unpacks bits into codepoints)
When dealing with bytes, codes are emitted as variable
length bit strings packed into the stream of bytes.
codepoints are written with varying length
- initially 9 bits
- at 512 entries 10 bits
- at 1025 entries at 11 bits
- at 2048 entries 12 bits
- with max of 4095 entries in a table (including Clear and EOI)
code points are stored with their MSB in the most significant bit
available in the output character.
def decompress_without_eoi(buf):
# Decompress LZW into a bytes, ignoring End of Information code
def gen():
try:
for byte in lzw.decompress(buf):
yield byte
except ValueError as exc:
#print(repr(exc))
if 'End of information code' in repr(exc):
#print('Ignoring EOI error..\n')
pass
else:
raise
return
try:
#print('Trying a join..\n')
deblob = b''.join(gen())
except Exception as exc2:
#print(repr(exc2))
#print('Trying byte by byte..')
deblob=[]
try:
for byte in gen():
deblob.append(byte)
except Exception as exc3:
#print(repr(exc3))
return b''.join(deblob)
return deblob
#current function to deblob
def deblob3(row):
if pd.notnull(row[0]):
blob = row[0]
h = html2text.HTML2Text()
h.ignore_links=True
h.ignore_images = True #zzzz
if type(blob) != bytes:
blobbytes = blob.read()[:-10]
else:
blobbytes = blob[:-10]
if row[1]==361:
# If compressed, return up to EOI-257 code, which is last non-null code before tag
# print (row[0])
return h.handle(striprtf(decompress_without_eoi(blobbytes)))
elif row[1]==360:
# If uncompressed, return up to tag
return h.handle(striprtf(blobbytes))
已按以下方式调用此功能
nf['IS_BLOB'] = nf[['IS_BLOB','COMPRESSION']].apply(deblob3,axis=1)