Question

我需要打开一个.bi5文件并阅读内容以简短地讲述一个长篇故事。问题：我有成千上万个包含时间序列数据的.bi5文件需要解压缩和处理（读取，转储到pandas中）。

我最终专门为lzma库安装了Python 3（我通常使用2.7），因为我遇到了使用Python {的lzma后端编译的恶梦，所以我让步并运行使用Python 3，但没有成功。问题太多了，无法透露，没有人读长问题！

我已经包含了一个.bi5文件，如果有人可以设法将其放入Pandas Dataframe并告诉我他们是如何做到的，那将是理想的。

ps fie只有几kb，它会在一秒内下载。首先十分感谢。

（文件） http://www.filedropper.com/13hticks

Answer 1

下面的代码可以解决问题。首先，它打开一个文件并在lzma中对其进行解码，然后使用struct来解压缩二进制数据。

import lzma
import struct
import pandas as pd


def bi5_to_df(filename, fmt):
    chunk_size = struct.calcsize(fmt)
    data = []
    with lzma.open(filename) as f:
        while True:
            chunk = f.read(chunk_size)
            if chunk:
                data.append(struct.unpack(fmt, chunk))
            else:
                break
    df = pd.DataFrame(data)
    return df

最重要的是要了解正确的格式。我用Google搜索并试图猜测'>3i2f'（或>3I2f）效果很好。（它的大端3个整数2个浮点数。你的建议：'i4f'不会产生合理的浮点数 - 无论是大端还是小端。）对于struct和格式语法，请参阅docs

df = bi5_to_df('13h_ticks.bi5', '>3i2f')
df.head()
Out[177]: 
      0       1       2     3     4
0   210  110218  110216  1.87  1.12
1   362  110219  110216  1.00  5.85
2   875  110220  110217  1.00  1.12
3  1408  110220  110218  1.50  1.00
4  1884  110221  110219  3.94  1.00

<强>更新

将bi5_to_df的输出与https://github.com/ninety47/dukascopy进行比较，我从那里编译并运行test_read_bi5。输出的第一行是：

time, bid, bid_vol, ask, ask_vol
2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

同一输入文件上的bi5_to_df给出：

bi5_to_df('01h_ticks.bi5', '>3I2f').head()
Out[295]: 
      0       1       2     3    4
0  3581  131966  131945  1.50  1.5
1  5142  131964  131943  1.50  1.5
2  5202  131964  131943  2.25  1.5
3  5321  131964  131944  1.50  1.5
4  5441  131964  131944  1.50  1.5

所以一切似乎都很好（九十七码＆＃39;代码重新排序列。）

此外，使用'>3I2f'代替'>3i2f'（即unsigned int代替int）可能更准确。

Answer 2

import requests
import struct
from lzma import LZMADecompressor, FORMAT_AUTO

# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 file
res = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)
print(res.headers.get('content-type'))

rawdata = res.content

decomp = LZMADecompressor(FORMAT_AUTO, None, None)
decompresseddata = decomp.decompress(rawdata)

firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])
print("firstrow:", firstrow)
# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)
# time = 2020/06/15/10h + (1 month) + 436 milisecond

secondrow = struct.unpack('!IIIff', decompresseddata[20: 40])
print("secondrow:", secondrow)
# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)

# time = 2020/06/15/10h + (1 month) + 537 milisecond
# ask = 114271 / 100000 = 1.14271
# bid = 114267 / 100000 = 1.14267
# askvolume = 4.31
# bidvolume = 2.25

# note that 00 -> is january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february

#  iterating
print(len(decompresseddata), int(len(decompresseddata) / 20))
for i in range(0, int(len(decompresseddata) / 20)):
    print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))

Answer 3

在将数据传输到pandas之前，您是否尝试使用numpy来解析数据？也许是一个很长的解决方案，但我会允许你在Panda进行分析之前操作和清理数据，它们之间的整合也非常简单，

解压缩并阅读Dukascopy .bi5 tick文件

3 个答案: