我正在尝试从IBM大型机读取一些固定宽度的数据到Pandas中。这些字段以EBCDIC,数字保存为二进制(即255存储为0xFF)和二进制编码的十进制(即255存储为0x02FF)的形式存储。我知道字段的长度和类型。
read_fwf可以处理此类数据吗?有更好的选择吗?
示例-我尝试读取的结构有任意数量的记录。
import tempfile
databin = 0xF0F3F1F5F1F3F9F9F2F50AC2BB85F0F461F2F061F2F0F1F8F2F0F1F860F0F360F2F360F1F54BF4F54BF5F44BF5F9F2F9F1F800004908
#column 1 -- ten bytes, EBCDIC. Should be 0315139925.
#column 2 -- four bytes, binary number. Should be 180534149.
#column 3 -- ten characters, EBCDIC. Should be 04/20/2018.
#column 4 -- twenty six characters, EBCDIC. Should be 2018-03-23-15.45.54.592918.
#column 5 -- five characters, packed binary coded decimal. Should be 4908. I know the scale ahead of time.
rawbin = databin.to_bytes((databin.bit_length() + 7) // 8, 'big') or b'\0'
with tempfile.TemporaryFile() as fp:
fp.write(rawbin)
答案 0 :(得分:1)
我认为最有可能发生的事情是您必须编写一些内容才能逐条记录,我认为它不可能像大熊猫一样工作,组件可以刹车成(必须为BCD部分无耻地复制并粘贴How to split a byte string into separate bytes in python):
def bcdDigits(chars):
for char in chars:
char = ord(char)
for val in (char >> 4, char & 0xF):
if val == 0xF:
return
yield val
In [40]: B
Out[40]: b'\xf0\xf3\xf1\xf5\xf1\xf3\xf9\xf9\xf2\xf5\n\xc2\xbb\x85\xf0\xf4a\xf2\xf0a\xf2\xf0\xf1\xf8\xf2\xf0\xf1\xf8`\xf0
\xf3`\xf2\xf3`\xf1\xf5K\xf4\xf5K\xf5\xf4K\xf5\xf9\xf2\xf9\xf1\xf8\x00\x00I\x08'
In [41]: import codecs
In [43]: codecs.decode(B[0:10], "cp500")
Out[43]: '0315139925'
In [44]: int.from_bytes(B[10:14], byteorder='big')
Out[44]: 180534149
In [45]: codecs.decode(B[14:24], "cp500")
Out[45]: '04/20/2018'
In [46]: codecs.decode(B[24:50], "cp500")
Out[46]: '2018-03-23-15.45.54.592918'
In [48]: list(bcdDigits([B[i: i+1] for i in range(50, 54)]))
Out[48]: [0, 0, 0, 0, 4, 9, 0, 8]
注意:对于最后一块,如果要获取整数作为回报:
In [63]: import numpy as np
In [64]: (list(bcdDigits([B[i: i+1] for i in range(50, 54)])) * (10 ** np.arange(8)[::-1])).sum()
Out[64]: 4908