Question

这是我的代码-

import pandas as pd
import time
from progressbar import ProgressBar
pbar = ProgressBar()
df = pd.read_excel('tt.xlsx', header=None)
text=df.values.T.tolist()
text = [[k.lower()] for l in text for k in l]
dj_count = {}
start = time.perf_counter() 
dj_count.update({''.join(i) : text.count(i) for i in pbar(text)})
time.sleep(0.01)
print ("time taken for script--", round(time.clock()-start , 2), "seconds")   
df = pd.DataFrame(list(zip(dj_count, dj_count.values())),columns=['Phrase', 'Count']).sort_values(['Count'],ascending=False)
df.head(12)

我阅读了xl文件，将其转换为列表，降低在列中找到的字符，启动计时器，计算整个代码中重复次数最多的短语（已实现的进度条和所花费的时间），然后打印带有count的前12个常用短语

当我对2万行执行此操作时，它会在几秒钟内发生，而到5万行时，则需要2分钟？然后100k需要10分钟，在excel中每10k-20k行它会增加很多。

我该如何加快此过程？ PS-我有8GB的ram和i5-4590

5万行的输出看起来像这样-

100% (51819 of 51819) |##################| Elapsed Time: 0:01:56 Time:  0:01:56
time taken for script-- 116.87 seconds
__main__:12: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
Out[2]: 
     Phrase            Count
31   bla bla bla ...   2340
214  lo yolo yolo...   1645
0    gg gg gg lol...   1615
21   bla lol gggg...   1004
6    busy busy  ...    800
68   your your y...    620
552  hi hihi hi ...    360
236  okokokokokok...   355
382  thank you ty...   325
58   djdjdjdjdj ...    305
961  gdgdgdgdgdg...    300
400  tyggtyggtyggtyg   285

提高Excel处理速度，或在Python中循环

0 个答案: