Question

我有一张包含数百万笔交易的表格。该表包含事务的时间戳，数量和若干其他属性（例如，地址）。对于每笔交易，我想计算在时间范围内发生的交易的数量和金额总和，例如1个月，具有相同的地址。

以下是输入示例：

+----+---------------------+----------------+--------+
| id | ts                  | address        | amount |
+----+---------------------+----------------+--------+
| 0  | 2016-10-11 00:34:21 | 123 First St.  | 56.20  |
+----+---------------------+----------------+--------+
| 1  | 2016-10-13 02:53:58 | 456 Second St. | 96.19  |
+----+---------------------+----------------+--------+
| 2  | 2016-10-23 02:28:17 | 123 First St.  | 64.65  |
+----+---------------------+----------------+--------+
| 3  | 2016-10-31 07:14:35 | 456 Second St. | 36.38  |
+----+---------------------+----------------+--------+
| 4  | 2016-11-04 09:25:39 | 123 First St.  | 93.65  |
+----+---------------------+----------------+--------+
| 5  | 2016-11-20 22:30:15 | 123 First St.  | 88.39  |
+----+---------------------+----------------+--------+
| 6  | 2016-11-28 09:39:14 | 123 First St.  | 74.40  |
+----+---------------------+----------------+--------+
| 7  | 2016-12-03 17:09:12 | 123 First St.  | 83.13  |
+----+---------------------+----------------+--------+

这应输出：

+----+-------+--------+
| id | count | amount |
+----+-------+--------+
| 0  | 0     | 0.00   |
+----+-------+--------+
| 1  | 0     | 0.00   |
+----+-------+--------+
| 2  | 1     | 56.20  |
+----+-------+--------+
| 3  | 1     | 96.19  |
+----+-------+--------+
| 4  | 2     | 120.85 |
+----+-------+--------+
| 5  | 1     | 64.65  |
+----+-------+--------+
| 6  | 1     | 88.39  |
+----+-------+--------+
| 7  | 2     | 162.79 |
+----+-------+--------+

为了做到这一点，我按时间戳对表格进行了排序，然后我基本上使用了队列和词典，但它似乎运行得非常慢，所以我想知道是否有更好的方法来做到这一点。

这是我的代码：

import csv
import Queue
import time

props = [ 'address', ... ]
spans = { '1m': 2629800, ... }

h = [ 'id' ]
for value in [ 'count', 'amount' ]:
    for span in spans:
        for prop in props:
            h.append(span + '_' + prop + '_' + value)

tq = { }
kq = { }
vq = { }
for span in spans:
    tq[span] = Queue.Queue()
    kq[span] = { }
    vq[span] = { }
    for prop in props:
        kq[span][prop] = Queue.Queue()
        vq[span][prop] = { }

with open('transactions.csv', 'r') as csvin, open('velocities.csv', 'w') as csvout:
    reader = csv.DictReader(csvin)
    writer = csv.DictWriter(csvout, h)
    writer.writeheader()
    for i in reader:
        o = { 'id': i['id'] }
        ts = time.mktime(time.strptime(i['ts'], '%Y-%m-%d %H:%M:%S'))
        for span in spans:
            while not tq[span].empty() and ts > tq[span].queue[0] + spans[span]:
                tq[span].get()
                for prop in props:
                    key = kq[span][prop].get()
                    vq[span][prop][key].get()
                    if vq[span][prop][key].empty():
                        del vq[span][prop][key]
            tq[span].put(ts)
            for prop in props:
                kq[span][prop].put(i[prop])
                if not i[prop] in vq[span][prop]:
                    vq[span][prop][i[prop]] = Queue.Queue()
                o[span + '_' + prop + '_count'] = vq[span][prop][i[prop]].qsize()
                o[span + '_' + prop + '_amount'] = sum(vq[span][prop][i[prop]].queue)
                vq[span][prop][i[prop]].put(float(i['auth']))
        writer.writerow(o)
        csvout.flush()

我也尝试用RB树替换vq[span][prop]，但性能更差。

Answer 1

要么我从根本上误解了你正在尝试做什么，要么你做错了，因为你的代码非常更复杂（不复杂，复杂）而不是它需要如果你正在做你说你正在做的事情。

import csv
from collections import namedtuple, defaultdict, Counter
from datetime import datetime

Span = namedtuple('Span', ('start', 'end'))

month_span = Span(start=datetime(2016, 1, 1), end=datetime(2016, 1, 31))
counts = defaultdict(Counter)
amounts = defaultdict(Counter)
with open('transactions.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        timestamp = datetime.strptime(row['ts'], '%Y-%m-%d %H:%M:%S')
        if month_span.start < timestamp < month_span.end:  # or <=
            # You do some checking for properties. If you *will* always
            # have these columns, you *should* just use ``row['count']``
            # and ``row['amount']``
            counts[month_span][row['address']] += int(row.get('count', 0))
            amount[month_span][row['address']] += float(row.get('amount', 0.00))
print(counts)
print(amounts)

请注意，您正在仍在运营，正如您所说，“数百万笔交易”。无论你转向哪种方式都需要一段时间，因为你做了几百万次同样的事情。如果您想查看当前代码所花费的时间，可以profile it。我发现line profiler易于使用且运行良好。

很可能，因为你正在做你做了一百万次的事情，你不会能够加快这一速度，而不会降低到较低级别的语言，例如： Cython，C或C ++。这会加速一些事情，但编写代码肯定会困难得多。

从事务日志计算速度变量

1 个答案: