我想出了这两种方法。还有更好的吗?
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
>>> print df.sum().sum()
42
>>> print df.values.sum()
42
只是想确保我没有遗漏更明显的东西。
答案 0 :(得分:30)
with open('in.txt', 'r') as fin, open('out.txt', 'w') as fout:
for line in fin:
if line.startswith('HISTO'):
continue
elif line.startswith('IMAGE'):
prefix = line.strip()
elif line.startswith('FRAG'):
fout.write(prefix + ' ' + line)
import subprocess
with open('input.txt', 'r') as fin, open('out.txt', 'w') as fout:
subprocess.run(["awk", "/^IMAGE/{img=$0;next} /^HISTO/{next} {print img,substr($0,1)}", "input.txt"], stdout=fout)
底层的numpy数组
df.to_numpy().sum()
numpy sum方法是否更快
答案 1 :(得分:0)
添加一些数字以支持此操作
import numpy as np, pandas as pd
import timeit
df = pd.DataFrame(np.arange(int(1e6)).reshape(500000, 2), columns=list("ab"))
def pandas_test():
return df['a'].sum()
def numpy_test():
return df['a'].to_numpy().sum()
timeit.timeit(numpy_test, number=1000) # 0.5032469799989485
timeit.timeit(pandas_test, number=1000) # 0.6035906639990571
因此,仅对于系列求和,我们的机器上就会获得20%的性能!