我有一张包含数百万笔交易的表格。该表包含事务的时间戳,数量和若干其他属性(例如,地址)。对于每笔交易,我想计算在时间范围内发生的交易的数量和金额总和,例如1个月,具有相同的地址。
以下是输入示例:
+----+---------------------+----------------+--------+
| id | ts | address | amount |
+----+---------------------+----------------+--------+
| 0 | 2016-10-11 00:34:21 | 123 First St. | 56.20 |
+----+---------------------+----------------+--------+
| 1 | 2016-10-13 02:53:58 | 456 Second St. | 96.19 |
+----+---------------------+----------------+--------+
| 2 | 2016-10-23 02:28:17 | 123 First St. | 64.65 |
+----+---------------------+----------------+--------+
| 3 | 2016-10-31 07:14:35 | 456 Second St. | 36.38 |
+----+---------------------+----------------+--------+
| 4 | 2016-11-04 09:25:39 | 123 First St. | 93.65 |
+----+---------------------+----------------+--------+
| 5 | 2016-11-20 22:30:15 | 123 First St. | 88.39 |
+----+---------------------+----------------+--------+
| 6 | 2016-11-28 09:39:14 | 123 First St. | 74.40 |
+----+---------------------+----------------+--------+
| 7 | 2016-12-03 17:09:12 | 123 First St. | 83.13 |
+----+---------------------+----------------+--------+
这应输出:
+----+-------+--------+
| id | count | amount |
+----+-------+--------+
| 0 | 0 | 0.00 |
+----+-------+--------+
| 1 | 0 | 0.00 |
+----+-------+--------+
| 2 | 1 | 56.20 |
+----+-------+--------+
| 3 | 1 | 96.19 |
+----+-------+--------+
| 4 | 2 | 120.85 |
+----+-------+--------+
| 5 | 1 | 64.65 |
+----+-------+--------+
| 6 | 1 | 88.39 |
+----+-------+--------+
| 7 | 2 | 162.79 |
+----+-------+--------+
为了做到这一点,我按时间戳对表格进行了排序,然后我基本上使用了队列和词典,但它似乎运行得非常慢,所以我想知道是否有更好的方法来做到这一点。
这是我的代码:
import csv
import Queue
import time
props = [ 'address', ... ]
spans = { '1m': 2629800, ... }
h = [ 'id' ]
for value in [ 'count', 'amount' ]:
for span in spans:
for prop in props:
h.append(span + '_' + prop + '_' + value)
tq = { }
kq = { }
vq = { }
for span in spans:
tq[span] = Queue.Queue()
kq[span] = { }
vq[span] = { }
for prop in props:
kq[span][prop] = Queue.Queue()
vq[span][prop] = { }
with open('transactions.csv', 'r') as csvin, open('velocities.csv', 'w') as csvout:
reader = csv.DictReader(csvin)
writer = csv.DictWriter(csvout, h)
writer.writeheader()
for i in reader:
o = { 'id': i['id'] }
ts = time.mktime(time.strptime(i['ts'], '%Y-%m-%d %H:%M:%S'))
for span in spans:
while not tq[span].empty() and ts > tq[span].queue[0] + spans[span]:
tq[span].get()
for prop in props:
key = kq[span][prop].get()
vq[span][prop][key].get()
if vq[span][prop][key].empty():
del vq[span][prop][key]
tq[span].put(ts)
for prop in props:
kq[span][prop].put(i[prop])
if not i[prop] in vq[span][prop]:
vq[span][prop][i[prop]] = Queue.Queue()
o[span + '_' + prop + '_count'] = vq[span][prop][i[prop]].qsize()
o[span + '_' + prop + '_amount'] = sum(vq[span][prop][i[prop]].queue)
vq[span][prop][i[prop]].put(float(i['auth']))
writer.writerow(o)
csvout.flush()
我也尝试用RB树替换vq[span][prop]
,但性能更差。
答案 0 :(得分:0)
要么我从根本上误解了你正在尝试做什么,要么你做错了,因为你的代码非常更复杂(不复杂,复杂)而不是它需要如果你正在做你说你正在做的事情。
import csv
from collections import namedtuple, defaultdict, Counter
from datetime import datetime
Span = namedtuple('Span', ('start', 'end'))
month_span = Span(start=datetime(2016, 1, 1), end=datetime(2016, 1, 31))
counts = defaultdict(Counter)
amounts = defaultdict(Counter)
with open('transactions.csv') as f:
reader = csv.DictReader(f)
for row in reader:
timestamp = datetime.strptime(row['ts'], '%Y-%m-%d %H:%M:%S')
if month_span.start < timestamp < month_span.end: # or <=
# You do some checking for properties. If you *will* always
# have these columns, you *should* just use ``row['count']``
# and ``row['amount']``
counts[month_span][row['address']] += int(row.get('count', 0))
amount[month_span][row['address']] += float(row.get('amount', 0.00))
print(counts)
print(amounts)
请注意,您正在仍在运营,正如您所说,“数百万笔交易”。无论你转向哪种方式都需要一段时间,因为你做了几百万次同样的事情。如果您想查看当前代码所花费的时间,可以profile it。我发现line profiler易于使用且运行良好。
很可能,因为你正在做你做了一百万次的事情,你不会能够加快这一速度,而不会降低到较低级别的语言,例如: Cython,C或C ++。这会加速一些事情,但编写代码肯定会困难得多。