Question

我正在从单词语料库中进行一些文本挖掘，而且我的文本文件输出有3000行，如下所示：

dns 11 11 [2,355,706,1063,3139,3219,3471,3472,3473,4384，   4444]

xhtml 8 11 [1651,2208,2815,3487,3517,4480,4481,4504]

javascript 18 18 [49,50,175,176,355,706,1063,1502,1651,2208，   2280,2815,3297,4068,4236,4480,4481,4504]

有一个词，它出现的行数，总出现次数以及这些行的n°。

我正在尝试计算卡方值，并且该文本文件是我的代码的输入：

measure = nltk.collocations.BigramAssocMeasures()

dicto = {} 
for i in lines :
    tokens = nltk.wordpunct_tokenize(i)
    m = tokens[0]       #m is the word
    list_i = tokens[4:]
    list_i.pop()
    for x in list_i :
        if x ==',':
            ind = list_i.index(x)
            list_i.pop(ind)
    dicto[m]=list_i #for each word i create a dictionnary with the n° of lines

#for each word I calculate the Chi-squared with every other word 
#and my problem is starting right here i think
#The "for" loop and the z = .....


for word1 in dicto :
    x=dicto[word1]
    vector = []

    for word2 in dicto :    
        y=dicto[word2]
        z=[val for val in x if val in y]

        #Contingency Matrix
        m11 = cpt-(len(x)+len(y)-len(z))
        m12 = len(x)-len(z)
        m21 = len(y)-len(z)
        m22 = len(z)

        n_ii =m11
        n_ix =m11+m21
        n_xi =m11+m12
        n_xx =m11+m12+m21+m22 

        Chi_squared = measure.chi_sq(n_ii, (n_ix, n_xi), n_xx)

        #I compare with the minimum value to check independancy between words
        if Chi_squared >3.841 :
            vector.append([word1, word2 , round(Chi_square,3))

    #The correlations calculated
    #I sort my vector in a descending way
    final=sorted(vector, key=lambda vector: vector[2],reverse = True)

    print word1
    #I take the 4 best scores
    for i in final[:4]:
        print i,

我的问题是计算花了很多时间（我说的是几个小时!!）有什么我可以改变的吗？我做了什么来改进我的代码？任何其他Python结构？任何想法？

Answer 1

加速有一些机会，但我首先关心的是 vector 。它在哪里初始化？在发布的代码中，它获得n ^ 2个条目并排序n次！这似乎是无意的。应该清除吗？最终应该在循环之外吗？

final = sorted（vector，key = lambda vector：vector [2]，reverse = True）

是功能性的，但有一个丑陋的范围，更好的是：

final = sorted（vector，key = lambda entry：entry [2]，reverse = True）

一般而言，要解决时间问题，请考虑使用profiler。

Answer 2

首先，如果每个单词都有唯一的行号，请使用集合而不是列表：查找集合交集比列表的交集要快得多（特别是如果列表没有排序）。

其次，预计算列表长度 - 现在您为每个内循环步骤计算两次。

第三 - 使用<system.web> <httpRuntime maxRequestLength="2147483647" /> </system.web>进行此类计算。

Python：如何优化计算？

2 个答案: