Question

I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. Compared to my classmates, mine is irregularly long. It might be my computer or my code, but I have no idea. Does anyone know any tips or shortcuts into making the speed of my code shorter? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (:

(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!)

import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added

size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file,    INCLUDING SPACES
print('The size of your file is ', size1,)

words = re.findall('\w+', open('file.txt').read()) 
wordcounts = Counter(words) #turns all words into array, even capitals 
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words

keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
    keyword.append(key) #adds each keyword to the array called keywords
    kcount.append(value)

characters =['$','#','@','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
         'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
         'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
         'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
         'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
         'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
         'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
         'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
        'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
         'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
         'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
         'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
         'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
         'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use

symbols_words = []
char = 0
for i in common100:
    symbols_words.append(characters[char]) #makes the array literally contain 0 values
        char = char + 1

print("Compression has now started")

f = 0
g = 0
no = 0
while no < 100:
    for i in common100:
        for w in words:
            if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
                place = words.index(i)#find exactly where the most common words are in the text
                symbols = symbols_words[common100.index(i)] #assigns one character with one common word
                words[place] = symbols # replaces the word with the symbol
                g = g + 1
    no = no + 1


string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()

size2 = os.path.getsize('compression.txt')

no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
  'The size of your file now is ', size2)

Answer 1

我认为第一件对性能不利的事情是：

for i in common100:
    for w in words:
        if i == w and len(i)>1:
            ...

您正在做的是查看单词w是否在您的common100单词列表中。但是，这个检查可以在O（1）时间内通过使用一个集合完成，而不是循环遍历每个单词的所有前100个单词。

common_words = set(common100)
for w in words:
    if w in common_words:
        ...

Answer 2

使用类似

的内容

word_substitutes = dict(zip(common100, characters))

会给你一个把常用词映射到相应符号的词典。

然后你可以简单地迭代这些词：

# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
    # If the current word is in the `word_substitutes` dict, then we know its in the
    # 'common' words, and can be replaced by the symbol
    if word in word_substitutes:
        # Replaces the word in-place
        replacement_symbol = word_substitutes[word]
        words[word_idx] = replacement_symbol

这将提供更好的性能，因为用于公共字符号映射的字典查找在时间上是对数而不是线性的。因此整体复杂性将类似于O（N log（N））而不是O（N ^ 3），您可以从2个嵌套循环中获得.index()调用。

Answer 3

通常，您会执行以下操作：

衡量程序的每个“部分”需要多长时间。您可以使用分析器（例如标准库中的this one），或者只是在代码中添加一些times.append(time.time.now)并计算差异。然后你知道代码的哪一部分很慢。
看看你是否可以改进慢速部分的算法。 gnicholas回答显示了加快速度的一种可能性。 while no<=100似乎很可疑，也许可以改进。此步骤需要了解您使用的算法。请谨慎为您的用例选择最佳数据结构。
如果你不能使用更好的算法（因为你总是使用最好的方法来计算某些东西），你需要自己加速计算。数字填充从numpy中受益，cython基本上可以将python代码编译为C，而numba使用LLVM进行编译。

How do I optimize the speed of my python compression code?

3 个答案: