使用二进制数据将固定宽度的文件读取到Pandas中

时间:2018-09-07 14:29:32

标签: python pandas

我正在尝试从IBM大型机读取一些固定宽度的数据到Pandas中。这些字段以EBCDIC,数字保存为二进制(即255存储为0xFF)和二进制编码的十进制(即255存储为0x02FF)的形式存储。我知道字段的长度和类型。

read_fwf可以处理此类数据吗?有更好的选择吗?

示例-我尝试读取的结构有任意数量的记录。

import tempfile

databin = 0xF0F3F1F5F1F3F9F9F2F50AC2BB85F0F461F2F061F2F0F1F8F2F0F1F860F0F360F2F360F1F54BF4F54BF5F44BF5F9F2F9F1F800004908

#column 1 -- ten bytes, EBCDIC.  Should be 0315139925.
#column 2 -- four bytes, binary number.  Should be 180534149.
#column 3 -- ten characters, EBCDIC.  Should be 04/20/2018.
#column 4 -- twenty six characters, EBCDIC.  Should be 2018-03-23-15.45.54.592918.
#column 5 -- five characters, packed binary coded decimal.  Should be 4908.  I know the scale ahead of time.

rawbin = databin.to_bytes((databin.bit_length() + 7) // 8, 'big') or b'\0'

with tempfile.TemporaryFile() as fp:
    fp.write(rawbin)

1 个答案:

答案 0 :(得分:1)

我认为最有可能发生的事情是您必须编写一些内容才能逐条记录,我认为它不可能像大熊猫一样工作,组件可以刹车成(必须为BCD部分无耻地复制并粘贴How to split a byte string into separate bytes in python):

def bcdDigits(chars):
    for char in chars:
        char = ord(char)
        for val in (char >> 4, char & 0xF):
            if val == 0xF:
                return
            yield val


In [40]: B
Out[40]: b'\xf0\xf3\xf1\xf5\xf1\xf3\xf9\xf9\xf2\xf5\n\xc2\xbb\x85\xf0\xf4a\xf2\xf0a\xf2\xf0\xf1\xf8\xf2\xf0\xf1\xf8`\xf0
\xf3`\xf2\xf3`\xf1\xf5K\xf4\xf5K\xf5\xf4K\xf5\xf9\xf2\xf9\xf1\xf8\x00\x00I\x08'

In [41]: import codecs

In [43]: codecs.decode(B[0:10], "cp500")
Out[43]: '0315139925'

In [44]: int.from_bytes(B[10:14], byteorder='big')
Out[44]: 180534149

In [45]: codecs.decode(B[14:24], "cp500")
Out[45]: '04/20/2018'

In [46]: codecs.decode(B[24:50], "cp500")
Out[46]: '2018-03-23-15.45.54.592918'

In [48]: list(bcdDigits([B[i: i+1] for i in range(50, 54)]))
Out[48]: [0, 0, 0, 0, 4, 9, 0, 8]

注意:对于最后一块,如果要获取整数作为回报:

In [63]: import numpy as np

In [64]: (list(bcdDigits([B[i: i+1] for i in range(50, 54)])) * (10 ** np.arange(8)[::-1])).sum()
Out[64]: 4908