Question

说我的df看起来像这样：

    price   quantity
0   100     20
1   102     31
2   105     25
3   99      40
4   104     10
5   103     20
6   101     55

这里没有时间间隔。我需要计算每50件物品的体积加权平均价格。输出中的每一行（索引）将代表50个单位（而不是说5分钟的间隔），输出列将是体积加权价格。

使用pandas做任何干净的方法，或者numpy为什么？我尝试使用循环将每一行拆分为一个项目价格，并将它们分组为：

def grouper(n, iterable):
    it = iter(iterable)
    while True:
       chunk = tuple(itertools.islice(it, n))
       if not chunk:
           return
       yield chunk

但它永远需要我耗尽内存.. df是几百万行。

修改我希望根据以上内容看到的输出是：

     vwap
0    101.20
1    102.12
2    103.36
3    101.00

每50件商品获得新的平均价格。

Answer 1

我在面对这个问题的第一次击球时击球了。这是我的下一个板块外观。希望我可以把球打进去并得分。

首先，让我们解决一些与此项工作的预期结果相关的评论。 OP发布了他认为结果应该使用他提供的小样本数据的内容。然而，@ user7138814和我都提出了与OP不同的结果。让我解释一下我怎么认为应该使用OP的例子来计算恰好50个单位的加权平均值。我将此工作表用作插图。

前两列（A和B）是OP给出的原始值。给定这些值，目标是计算恰好50个单位的每个块的加权平均值。不幸的是，数量不能被50整除。列C和D表示如何通过根据需要细分原始数量来创建50个单位的偶数块。黄色阴影区域显示原始数量如何细分，每个绿色有界单元总和恰好为50个单位。一旦确定了50个单位，就可以在E列中计算加权平均值。如您所见，E中的值与@ user7138814在其评论中发布的内容相匹配，因此我认为我们同意该方法。

经过多次反复试验后，最终的解决方案是对基础价格和数量系列的numpy数组进行操作的函数。使用Numba装饰器进一步优化该函数，以将Python代码jit编译为机器级代码。在我的笔记本电脑上，它在一秒钟内处理了300万行数组。

这是功能。

@numba.jit
def vwap50_jit(price_col, quantity_col):
    n_rows = len(price_col)
    assert len(price_col) == len(quantity_col)

    qty_cumdif = 50  # cum difference of quantity to track when 50 units are reached
    pq = 0.0  # cumsum of price * quantity
    vwap50 = []  # list of weighted averages
    for i in range(n_rows):
        price, qty = price_col[i], quantity_col[i]

        # if current qty will cause more than 50 units
        # divide the units
        if qty_cumdif < qty:
            pq += qty_cumdif * price
            # at this point, 50 units accumulated. calculate average.
            vwap50.append(pq / 50)
            qty -= qty_cumdif
            # continue dividing
            while qty >= 50:
                qty -= 50
                vwap50.append(price)
            # remaining qty and pq become starting
            # values for next group of 50
            qty_cumdif = 50 - qty
            pq = qty * price
        # process price, qty pair as-is
        else:
            qty_cumdif -= qty
            pq += qty * price
    return np.array(vwap50)

处理OP样本数据的结果。

Out[6]: 
   price  quantity
0    100        20
1    102        31
2    105        25
3     99        40
4    104        10
5    103        20
6    101        55

vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101.  ])

请注意，我使用.values方法传递pandas系列的numpy数组。这是使用numba的要求之一。 Numba意识到了numpy，并且不会处理pandas对象。

它在300万行数组上表现相当不错，创造了225万加权平均值的输出数组。

df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
                  'quantity': np.random.randint(1, 75, 3000000)})


%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vwap = vwap50_jit(df.price.values, df.quantity.values)

vwap.shape
Out[11]: (2250037,)

重新采样非时间相关的存储桶

1 个答案: