Question

我要处理熊猫中的二级库存数据。为了简单起见，假设每行有四种数据：

millis：时间戳记，int64
last_price：最后交易价格float64
ask_queue：ask端的容量，一个固定大小（200）的int32数组
bid_queue：出价方的数量，int32的固定大小（200）数组

在numpy中可以轻松地将其定义为结构化dtype：

dtype = np.dtype([
   ('millis', 'int64'), 
   ('last_price', 'float64'), 
   ('ask_queue', ('int32', 200)), 
   ('bid_queue', ('int32', 200))
])

这样，我可以像这样访问ask_queue和bid_queue：

In [17]: data = np.random.randint(0, 100, 1616 * 5).view(dtype)

% compute the average of ask_queue level 5 ~ 10
In [18]: data['ask_queue'][:, 5:10].mean(axis=1)  
Out[18]: 
array([33.2, 51. , 54.6, 53.4, 15. , 37.8, 29.6, 58.6, 32.2, 51.6, 34.4,
       43.2, 58.4, 26.8, 54. , 59.4, 58.8, 38.8, 35.2, 71.2])

我的问题是如何定义DataFrame包含数据？

这里有两种解决方案：

A。将ask_queue和bid_queue设置为两列，其数组值如下：

In [5]: df = pd.DataFrame(data.tolist(), columns=data.dtype.names)

In [6]: df.dtypes
Out[6]: 
millis          int64
last_price    float64
ask_queue      object
bid_queue      object
dtype: object

但是，此解决方案至少存在两个问题：

ask_queue和bid_queue失去了2D数组的dtype和所有方便的方法；
性能，因为它成为对象的数组而不是2D 数组。

B。将ask_queue和bid_quene展平为2 * 200列：

In [8]: ntype = np.dtype([('millis', 'int64'), ('last_price', 'float64')] + 
   ...:                  [(f'{name}{i}', 'int32') for name in ['ask', 'bid'] for i in range(200)])

In [9]: df = pd.DataFrame.from_records(data.view(ntype))

In [10]: df.dtypes
Out[10]: 
millis          int64
last_price    float64
ask0            int32
ask1            int32
ask2            int32
ask3            int32
ask4            int32
ask5            int32
...

这比解决方案A更好。但是2 * 200列看起来很多余。

有什么解决方案可以利用numpy中的结构化dtype的优势吗？我想知道ExtensionArray或`ExtensionDtype'是否可以解决这个问题。

Answer 1

Q：作为dtype中的结构化numpy，有什么解决方案可以利用吗？

使用L2-DoM数据要比仅使用ToB（价格最高的价格）数据要复杂得多。 a）本机订阅源速度很快（非常快/ FIX协议或其他私有数据订阅源提供的记录每毫秒L2 DoM更改成百上千（主要专业事件发生时更多）。存储必须以性能为导向 b），由于项目a）的性质，任何类型的离线分析都必须成功地操纵和有效处理大型数据集。

存储首选项
使用numpy -类似于语法首选项
性能首选项

存储首选项：已解决

鉴于pandas.DataFrame被设置为首选存储类型，尽管语法和性能首选项可能产生不利影响，我们也要尊重这一点。

可以采取其他方式，但可能会带来未知的重构/重新设计成本，O / P的运营环境不需要或已经不愿承担。

话虽这么说，pandas的功能限制必须纳入设计考虑因素，所有其他步骤都必须遵守，除非将来可能会修改此首选项。

`numpy` -相似的语法：已解决

此请求是明确而明确的，因为numpy工具是针对高性能数字处理而快速而智能地制作的。给定已设置的存储偏好设置，我们将实施一对numpy技巧，以便以合理的价格将pandas 2D- DataFrame .STORE和.RETRIEVE方向：

 # on .STORE:
 testDF['ask_DoM'][aRowIDX] = ask200.dumps()      # type(ask200) <class 'numpy.ndarray'>

 # on .RETRIEVE:
 L2_ASK = np.loads( testDF['ask_DoM'][aRowIDX] )  # type(L2_ASK) <class 'numpy.ndarray'>

性能偏好设置：已测试

针对.STORE和.RETRIEVE两个方向的提议解决方案的净附加成本经过测试得出：

在.STORE方向上的一次性费用不少于 70 [us] 并且不超过 ~ 160 [us] 对于给定比例的L2_DoM数组（平均：78 [ms] StDev：9-11 [ms]）每个单元格：

>>> [ f( [testDUMPs() for _ in range(1000)] ) for f in (np.min,np.mean,np.std,np.max) ]
[72, 79.284, 11.004153942943548, 150]
[72, 78.048, 10.546135548152224, 160]
[71, 78.584,  9.887971227708949, 139]
[72, 76.9,    8.827332496286745, 132]

在.RETRIEVE方向上的重复费用不少于 46 [us] 并且不超过 ~ 123 [us] ”（平均：50 [us] StDev：9.5 [us]）：

>>> [ f( [testLOADs() for _ in range(1000)] ) for f in (np.min,np.mean,np.std,np.max) ] [46, 50.337, 9.655194197943405, 104] [46, 49.649, 9.462272665697178, 123] [46, 49.513, 9.504293766503643, 123] [46, 49.77, 8.367165350344164, 114] [46, 51.355, 6.162434583831296, 89]

如果使用更好的与体系结构对齐的int64数据类型，则有望获得更高的性能（是的，以两倍的存储成本为代价，但是计算成本将决定此举是否具有性能优势）和通过使用基于memoryview的操作的机会，可以减少痛苦，并将附加延迟减少到大约22 [us]。

_{Test在py3.5.6，numpy v1.15.2下运行，使用：}

>>> import numpy as np; ask200 = np.arange( 200, dtype = np.int32 ); s = ask200.dumps() >>> from zmq import Stopwatch; aClk = Stopwatch() >>> def testDUMPs(): ... aClk.start() ... s = ask200.dumps() ... return aClk.stop() ... >>> def testLOADs(): ... aClk.start() ... a = np.loads( s ) ... return aClk.stop() ...

_{平台CPU，缓存层次结构和RAM详细信息：}

>>> get_numexpr_cpuinfo_details_on_CPU() 'TLB size'______________________________:'1536 4K pages' 'address sizes'_________________________:'48 bits physical, 48 bits virtual' 'apicid'________________________________:'17' 'bogomips'______________________________:'7199.92' 'bugs'__________________________________:'fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2' 'cache size'____________________________:'2048 KB' 'cache_alignment'_______________________:'64' 'clflush size'__________________________:'64' 'core id'_______________________________:'1' 'cpu MHz'_______________________________:'1400.000' 'cpu cores'_____________________________:'2' 'cpu family'____________________________:'21' 'cpuid level'___________________________:'13' 'flags'_________________________________:'fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold' 'fpu'___________________________________:'yes' 'fpu_exception'_________________________:'yes' 'initial apicid'________________________:'1' 'microcode'_____________________________:'0x6000626' 'model'_________________________________:'1' 'model name'____________________________:'AMD FX(tm)-4100 Quad-Core Processor' 'physical id'___________________________:'0' 'power management'______________________:'ts ttp tm 100mhzsteps hwpstate cpb' 'processor'_____________________________:'1' 'siblings'______________________________:'4' 'stepping'______________________________:'2' 'vendor_id'_____________________________:'AuthenticAMD' 'wp'____________________________________:'yes'

Answer 2

Pandas旨在处理和处理二维数据（您将在电子表格中放入的数据）。由于“ ask_queue”和“ bid_queue”不是一维序列，而是二维数组，因此您不能（轻松地）将它们推入Pandas数据框中。

在这种情况下，您必须使用其他库，例如xarray：http://xarray.pydata.org/

import xarray as xr

# Creating variables, first argument is the name of the dimensions
last_price = xr.Variable("millis", data["last_price"])
ask_queue = xr.Variable(("millis", "levels"), data["ask_queue"])
bid_queue = xr.Variable(("millis", "levels"), data["bid_queue"])

# Putting the variables in a dataset, the multidimensional equivalent of a Pandas
# dataframe
ds = xr.Dataset({"last_price": last_price, "ask_queue": ask_queue,
                 "bid_queue": bid_queue}, coords={"millis": data["millis"]})

# Computing the average of ask_queue level 5~10
ds["ask_queue"][{"levels": slice(5,10)}].mean(axis=1)

是否有任何优雅的方法来定义带有dtype数组列的数据框？

2 个答案:

存储首选项：已解决

`numpy` -相似的语法：已解决

性能偏好设置：已测试

是否有任何优雅的方法来定义带有dtype数组列的数据框？

2 个答案:

存储首选项：已解决

numpy -相似的语法：已解决

性能偏好设置：已测试

`numpy` -相似的语法：已解决