每个周期后,熊猫数据帧的生成速度变慢?如何保持速度?

时间:2020-02-09 00:59:56

标签: python pandas performance dataframe

我正在尝试使用熊猫为我的数据分析生成一个大数据框。数据看起来像:

RNAME Start End Count
Chr1   1     3    1
Chr1   2     5    1
Chr1   4     6    1
Chr1   5     9    2
Chr1   2     5    1
...

我发现,如果将行数上限提高到10 ^ 7,程序将永远运行,并且无法完成任务。因此,我在代码中插入了时间检查代码,发现随着总行数的增加,添加相同行数的时间急剧增加。那么有人可以帮助我解决这个问题吗? 这是代码:

import pandas as pd
import time

start1 = time.perf_counter()
step_size = 5
new_df = pd.DataFrame(columns=['RNAME', 'start', 'end', 'central'])
start = 0
windowsize = 100
i = 0
while i <= 200000:
    if i == 0:
        t1 = time.perf_counter()
    if i == 10000:
        t2 = time.perf_counter()
    if i == 20000:
        t3 = time.perf_counter()
    if i == 30000:
        t4 = time.perf_counter()
    if i == 40000:
        t5 = time.perf_counter()
    if i == 50000:
        t6 = time.perf_counter()
    if i == 60000:
        t7 = time.perf_counter()
    if i == 70000:
        t8 = time.perf_counter()
    if i == 80000:
        t9 = time.perf_counter()
    if i == 90000:
        t10 = time.perf_counter()
    if i == 100000:
        t11 = time.perf_counter()
    if i == 110000:
        t12 = time.perf_counter()
    if i == 120000:
        t13 = time.perf_counter()
    if i == 130000:
        t14 = time.perf_counter()
    if i == 140000:
        t15 = time.perf_counter()
    if i == 150000:
        t16 = time.perf_counter()
    if i == 160000:
        t17 = time.perf_counter()
    if i == 170000:
        t18 = time.perf_counter()
    if i == 180000:
        t19 = time.perf_counter()
    if i == 190000:
        t20 = time.perf_counter()
    df = pd.DataFrame([['chr1', start+i, i + start + windowsize - 1, i + start + round(windowsize/2)-1]], columns=['RNAME', 'start', 'end', 'central'])
    i += step_size
    new_df = pd.concat([new_df, df], sort=False)
new_df.reset_index(inplace=True, drop=True)
new_df.to_csv(f'chr1_{start}_window_{windowsize}_step_{step_size}.bed', header=False, index=False, sep='\t')
end1 = time.perf_counter()
print(f'process finished in {round(end1 - start1, 2)} second(s)') 
print(f'the first 10000 lines finished in {round(t2-t1, 2)} secs')
print(f'the second 10000 lines finished in {round(t3-t2, 2)} secs')
print(f'the third 10000 lines finished in {round(t4-t3, 2)} secs')
print(f'the fourth 10000 lines finished in {round(t5-t4, 2)} secs')
print(f'the fifth 10000 lines finished in {round(t6-t5, 2)} secs')
print(f'the sixth 10000 lines finished in {round(t7-t6, 2)} secs')
print(f'the seventh 10000 lines finished in {round(t8-t7, 2)} secs')
print(f'the eighth 10000 lines finished in {round(t9-t8, 2)} secs')
print(f'the nineth 10000 lines finished in {round(t10-t9, 2)} secs')
print(f'the tenth 10000 lines finished in {round(t11-t10, 2)} secs')
print(f'the eleventh 10000 lines finished in {round(t12-t11, 2)} secs')
print(f'the twelve 10000 lines finished in {round(t13-t12, 2)} secs')
print(f'the thirteenth 10000 lines finished in {round(t14-t13, 2)} secs')
print(f'the fourteenth 10000 lines finished in {round(t15-t14, 2)} secs')
print(f'the fifteenth 10000 lines finished in {round(t16-t15, 2)} secs')
print(f'the sixteenth 10000 lines finished in {round(t17-t16, 2)} secs')
print(f'the seventeenth 10000 lines finished in {round(t18-t17, 2)} secs')
print(f'the eighteenth 10000 lines finished in {round(t19-t18, 2)} secs')
print(f'the nineteenth 10000 lines finished in {round(t20-t19, 2)} secs')

结果如下:

process finished in 204.99 second(s)
the first 10000 lines finished in 2.6 secs
the second 10000 lines finished in 3.33 secs
the third 10000 lines finished in 4.06 secs
the fourth 10000 lines finished in 4.86 secs
the fifth 10000 lines finished in 5.64 secs
the sixth 10000 lines finished in 6.5 secs
the seventh 10000 lines finished in 7.28 secs
the eighth 10000 lines finished in 8.14 secs
the nineth 10000 lines finished in 8.81 secs
the tenth 10000 lines finished in 9.71 secs
the eleventh 10000 lines finished in 10.4 secs
the twelve 10000 lines finished in 11.56 secs
the thirteenth 10000 lines finished in 12.25 secs
the fourteenth 10000 lines finished in 13.18 secs
the fifteenth 10000 lines finished in 13.96 secs
the sixteenth 10000 lines finished in 14.76 secs
the seventeenth 10000 lines finished in 15.83 secs
the eighteenth 10000 lines finished in 16.66 secs
the nineteenth 10000 lines finished in 17.24 secs

0 个答案:

没有答案