在DolphinDB数据库中重写kdb脚本

时间:2019-08-16 17:55:16

标签: kdb dolphindb

我正在尝试在DolphinDB中重写kdb脚本。

首先让我解释一下我需要做什么。如果信号高于阈值T1,我们会在证券中建立多头头寸。我们不想在信号跌落到T1以下时立即平仓,因此给它一个缓冲:只有当信号跌落到T10以下且小于T1时,我们才平仓。

另一方面,如果信号低于阈值T2,我们建立一个空头头寸。仅当信号移动到T20> T2以上时,我们才平仓。

T1> T10> T20> T2。

基本上我需要以下向量:

 - if signal>T1, return 1. Subsequent elements are 1 until when signal<T10; 
 - if signal<T2, return -1. Subsequent elements are -1 until when signal>T20;
 - 0 otherwise

上述任务的kdb脚本是:

0h^fills(-).(0N 1h)[(signal>T1;signal<T2)]^'(0N 0h)[(signal<T10;signal>T20)]

有人在DolphinDB中重写它吗?

4 个答案:

答案 0 :(得分:6)

DolphinDB 1.01引入了一项新功能,即JIT。只需在函数定义前面添加一个@jit标记,就可以大大提高性能。而且,可以使用for循环解决上述问题,比矢量化解决方案容易得多。

@jit
def calculate_with_jit(signal, n, t1, t10, t20, t2) {
  cur = 0
  idx = 0
  output = array(INT, n, n)
  for (s in signal) {
    if(s > t1) {           // (t1, inf)
      cur = 1
    } else if(s >= t10) {  // [t10, t1]
      if(cur == -1) cur = 0
    } else if(s > t20) {   // [t20, t10)
      cur = 0
    } else if(s >= t2) {   // [t2, t20]
      if(cur == 1) cur = 0
    } else {               // (-inf, t2)
      cur = -1
    }
    output[idx] = cur
    idx += 1
  }
  return output
}

在我的机器上,jit版本仅需 170ms 即可接收一千万个长信号,而矢量化版本仅需 410ms

有关更多详细信息,请参考jit教程(https://github.com/dolphindb/Tutorials_EN/blob/master/jit.md

答案 1 :(得分:5)

我在DolphinDB 0.97.4版中进行了文字翻译

eachPost(-, loop(nullFill, [iif(signal<T10, 0h, 00h), iif(signal>T20, 0h, 00h)], [iif(signal>T1, 1h, 00h), iif(signal<T2, 1h, 00h)]))[0].ffill().nullFill(0h)

iif(cond, trueResult, falseResult)是逐元素的条件函数。 00h表示短类型的空值。 nullFill(X, Y)用Y中的对应值替换X中的空值。ffill(X)用前面的值替换X中的空值。 loopeachPost都是两个高阶函数。

在DolphinDB中测试用例

T1= 60
T10 = 50
T20 = 30
T2 = 20
signal = 10 20 70 59 42 49 19 25 26  35
eachPost(-, loop(nullFill, [iif(signal<T10, 0h, 00h), iif(signal>T20, 0h, 00h)], [iif(signal>T1, 1h, 00h), iif(signal<T2, 1h, 00h)]))[0].ffill().nullFill(0h)

-1 -1 1 1 0 0 -1 -1 -1 0

在KDB +中测试用例

T1:60
T10:50
T20:30
T2:20
signal:10 20 70 59 42 49 19 25 26  35
0h^fills(-).(0N 1h)[(signal>T1;signal<T2)]^'(0N 0h)[(signal<T10;signal>T20)]

-1 -1 1 1 0 0 -1 -1 -1 0

我也做了快速的性能比较。我生成了1000万个随机信号,并分别在DolphinDB和KDB +中运行了上述表达式。 KDB +花了800ms,而DolphinDB花了480ms 。下面是性能测试代码。

//DolphinDB
T1= 60
T10 = 50
T20 = 30
T2 = 20
signal = 1 + rand(99.0, 10000000)
timer eachPost(-, loop(nullFill, [iif(signal<T10, 0h, 00h), iif(signal>T20, 0h, 00h)], [iif(signal>T1, 1h, 00h), iif(signal<T2, 1h, 00h)]))[0].ffill().nullFill(0h)

//KDB+
T1:60
T10:50
T20:30
T2:20
signal: 1.0 + 10000000 ? 99.0
\t  0h^fills(-).(0N 1h)[(signal>T1;signal<T2)]^'(0N 0h)[(signal<T10;signal>T20)]

答案 2 :(得分:3)

我像这样优化了DolphinDB中的测试代码:

t1= 60
t10 = 50
t20 = 30
t2 = 20
signal = rand(100.0, 10000000)
timer direction = (iif(signal >t1, 1h, iif(signal < t10, 0h, 00h)) - iif(signal <t2, 1h, iif(signal > t20, 0h, 00h))).ffill().nullFill(0h)

只花了330毫秒。


更新并行版本

DolphinDB提供功能pcall来并行执行。

def foo(signal){
 t1= 60
 t10 = 50
 t20 = 30
 t2 = 20
 return iif(signal >t1, 1h, iif(signal < t10, 0h, 00h)) - iif(signal <t2, 1h, iif(signal > t20, 0h, 00h))
}
signal = rand(100.0, 250000000)

//with single threads
timer foo(signal).ffill().nullFill(0h)

//with multiple threads
timer pcall(foo,signal).ffill().nullFill(0h)

单个线程花费了6,412毫秒进行计算(2.5亿),而两个线程花费了2,938ms,四个线程仅花费了2,086 ms。

答案 3 :(得分:1)

更新9/11/2019

注释更新版本

WITH 
    rand()/4294967295*100 AS s,
    60 AS t1, 
    50 AS t10, 
    30 AS t20, 
    20 AS t2, 
    if(s < t10, 0, if(s > t1, 1, NULL)) as signal1,
    if(s > t20, 0, if(s < t2, 1, NULL)) as signal2
SELECT
    arrayFill(x -> (x != -2), groupArray(toInt8(ifNull(signal1 - signal2, -2)))) as k
FROM numbers_mt(10000000);

根据Summer.H的反馈进行更新。在我的系统(Core i7-7820X)上,按照以下5个最佳时间运行。两者之间的计时差异很小。

1000万个信号

随机数生成+计算

  • 1个线程-DolphinDB 337ms,ClickHouse 291ms
  • 2个线程(1个核心)-DolphinDB 226ms,ClickHouse 189ms

仅计算

  • 1个线程-DolphinDB 302ms,ClickHouse 233ms
  • 2个线程(1个内核)-DolphinDB 179ms,ClickHouse 165ms

2.5亿个信号

随机数生成+计算

  • 1个线程-DolphinDB 7.901s,ClickHouse 7.103s
  • 2个线程(1个内核)-DolphinDB 4.786s,ClickHouse 4.297s
  • 4个线程(2个内核)-ClickHouse 2.965s

仅计算

  • 1个线程-DolphinDB 7.106s,ClickHouse 5.564s
  • 2个线程(1个内核)-DolphinDB 3.966s,ClickHouse 3.668s
  • 4个线程(2个内核)-ClickHouse 2.573s

原始

请注意,由于它与DolphinDB有关,因此无法回答特定的问题-但这也是使用ClickHouse的版本。

WITH 
    60 AS t1, 
    50 AS t10, 
    30 AS t20, 
    20 AS t2, 
    ([if(s < t10, 0, NULL), if(s > t20, 0, NULL)], [if(s > t1, 1, NULL), if(s < t2, 1, NULL)]) AS signal
SELECT arrayFill(x -> (x != -2), groupArray(ifNull(coalesce((signal.1)[1], (signal.2)[1]) - coalesce((signal.1)[2], (signal.2)[2]), -2))) AS k
FROM 
(
    SELECT arrayJoin([10, 20, 70, 59, 42, 49, 19, 25, 26, 35]) AS s
)
FORMAT TSV

[-1,-1,1,1,0,0,-1,-1,-1,0]

对1千万个随机样本进行基准化。在我的系统上,Summer.H最快的DolphinDB答案,最好用完5:

DolphinDB (4 threads)

./dolphindb
DolphinDB Systems 0.99.0 64 bit Copyright (c) 2011~2019 DolphinDB, Inc. Licensed to Trial User. Expires on 2019.12.31 (Build:2019.10.25)

>timer t1= 60
timer t10 = 50
timer t20 = 30
timer t2 = 20
timer signal = rand(100.0, 10000000)
timer direction = (iif(signal >t1, 1h, iif(signal < t10, 0h, 00h)) - iif(signal <t2, 1h, iif(signal > t20, 0h, 00h))).ffill().nullFill(0h)
;>>>>>>
Time elapsed: 0.01 ms
Time elapsed: 0.001 ms
Time elapsed: 0.001 ms
Time elapsed: 0.001 ms
Time elapsed: 72.675 ms
Time elapsed: 305.442 ms

Total time: 378 ms

ClickHouse(限制了4个线程,例如DolphinDB),最好用完5个:

CREATE TEMPORARY TABLE dtest2 AS
WITH
    rand()%100 + rand()/4294967295 AS s,
    60 AS t1,
    50 AS t10,
    30 AS t20,
    20 AS t2,
    ([if(s < t10, 0, NULL), if(s > t20, 0, NULL)], [if(s > t1, 1, NULL), if(s < t2, 1, NULL)]) AS signal
SELECT arrayFill(x -> (x != -2), groupArray(ifNull(coalesce((signal.1)[1], (signal.2)[1]) - coalesce((signal.1)[2], (signal.2)[2]), -2))) AS k
FROM numbers_mt(10000000)

Ok.

0 rows in set. Elapsed: 0.300 sec. Processed 10.00 million rows, 80.00 MB (33.37 million rows/s., 266.94 MB/s.)

Total time 300 ms

ClickHouse,默认配置/无线程限制,最好用完5:

CREATE TEMPORARY TABLE dtest2 AS
WITH
    rand()%100 + rand()/4294967295 AS s,
    60 AS t1,
    50 AS t10,
    30 AS t20,
    20 AS t2,
    ([if(s < t10, 0, NULL), if(s > t20, 0, NULL)], [if(s > t1, 1, NULL), if(s < t2, 1, NULL)]) AS signal
SELECT arrayFill(x -> (x != -2), groupArray(ifNull(coalesce((signal.1)[1], (signal.2)[1]) - coalesce((signal.1)[2], (signal.2)[2]), -2))) AS k
FROM numbers_mt(10000000)

Ok.

0 rows in set. Elapsed: 0.191 sec. Processed 10.00 million rows, 80.00 MB (52.22 million rows/s., 417.74 MB/s.)

每一个的最终计时:

kdb(800ms),DolphinDB(480ms,378ms,330ms?),ClickHouse(191ms)

我在这里进一步对2.5亿个随机信号进行了基准测试- DolphinDB花费了9214.99ms (1420ms随机信号生成+ 7794.24ms计算)。 ClickHouse总共花费了4272毫秒,用于随机生成和计算。

受到我的DolphinDB许可证的限制,但ClickHouse在17.4秒(内存表)或20.1秒(至磁盘)中管理了10亿个信号。