这是我的代码-
import pandas as pd
import time
from progressbar import ProgressBar
pbar = ProgressBar()
df = pd.read_excel('tt.xlsx', header=None)
text=df.values.T.tolist()
text = [[k.lower()] for l in text for k in l]
dj_count = {}
start = time.perf_counter()
dj_count.update({''.join(i) : text.count(i) for i in pbar(text)})
time.sleep(0.01)
print ("time taken for script--", round(time.clock()-start , 2), "seconds")
df = pd.DataFrame(list(zip(dj_count, dj_count.values())),columns=['Phrase', 'Count']).sort_values(['Count'],ascending=False)
df.head(12)
我阅读了xl文件,将其转换为列表,降低在列中找到的字符,启动计时器,计算整个代码中重复次数最多的短语(已实现的进度条和所花费的时间),然后打印带有count的前12个常用短语
当我对2万行执行此操作时,它会在几秒钟内发生,而到5万行时,则需要2分钟?然后100k需要10分钟,在excel中每10k-20k行它会增加很多。
我该如何加快此过程? PS-我有8GB的ram和i5-4590
5万行的输出看起来像这样-
100% (51819 of 51819) |##################| Elapsed Time: 0:01:56 Time: 0:01:56
time taken for script-- 116.87 seconds
__main__:12: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
Out[2]:
Phrase Count
31 bla bla bla ... 2340
214 lo yolo yolo... 1645
0 gg gg gg lol... 1615
21 bla lol gggg... 1004
6 busy busy ... 800
68 your your y... 620
552 hi hihi hi ... 360
236 okokokokokok... 355
382 thank you ty... 325
58 djdjdjdjdj ... 305
961 gdgdgdgdgdg... 300
400 tyggtyggtyggtyg 285