如何加快此单词组查找算法的速度?

时间:2019-05-25 11:38:16

标签: python performance nlp processing-efficiency

我正在尝试创建一个简单的模型来预测句子中的下一个单词。我有一个很大的.txt文件,其中包含以'\ n'分隔的句子。我还有一个词汇文件,其中列出了.txt文件中的每个唯一单词和唯一ID。我使用词汇表文件将语料库中的单词转换为相应的ID。现在,我想建立一个简单的模型,该模型从txt文件中读取ID,并找到单词对以及在语料库中看到该单词对的次数。我设法写了下面的代码:

tuples = [[]] #array for word tuples to be stored in
data = []   #array for tuple frequencies to be stored in

data.append(0) #tuples array starts with an empty element at the beginning for some reason.
            # Adding zero to the beginning of the frequency array levels the indexes of the two arrays

with open("markovData.txt") as f:
    contentData = f.readlines()
    contentData = [x.strip() for x in contentData]
    lineIndex = 0
    for line in contentData:
        tmpArray = line.split() #split line to array of words
        tupleIndex = 0
        tmpArrayIndex = 0
        for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
            if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
                data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
            else:
                tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
                data.append(1)                                                        #add the pair to list and set frequency to 1.

        #print every 1000th line to check the progress
        lineIndex += 1
        if ((lineIndex % 1000) == 0):
            print(lineIndex)

with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
    #write tuples to txt file
    for tuple in tuples:
        if (len(tuple) > 0): # if tuple is not epmty
            markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")

    markovWindowSize1File.write("\n")
    markovWindowSize1File.write("\n")
    #blank spaces between two data

    #write frequencies of the tuples to txt file
    for element in data:
        markovWindowSize1File.write(str(element) + " ")

    markovWindowSize1File.write("\n")
    markovWindowSize1File.write("\n")

此代码对于前几千行似乎运行良好。然后,事情开始变慢,因为元组列表不断变大,我必须搜索整个元组列表,以检查是否之前或之前都没有看到下一个单词对。我在30分钟内设法获得了5万行的数据,但是我的语料库更大,有数百万行。有没有一种方法可以更有效地存储和搜索单词对?矩阵的工作速度可能会快很多,但我的唯一字数约为300.000个字。这意味着我必须创建一个以整数作为数据类型的300k * 300k矩阵。即使在利用对称矩阵之后,它仍然需要比我多的很多内存。

我尝试使用numpy中的memmap将矩阵存储在磁盘而不是内存中,但是它需要大约500 GB的可用磁盘空间。

然后,我研究了稀疏矩阵,发现我可以存储非零值及其对应的行号和列号。这就是我在代码中所做的。

目前,此模型可以运行,但是很难正确猜出下一个单词(成功率约为8%)。我需要训练更大的语料才能取得更好的结果。如何使这个词对查找代码更有效?

谢谢。


编辑:感谢大家的回答,我现在可以在15秒钟内处理约50万行的语料库。我将为类似问题的人添加以下代码的最终版本:

import numpy as np
import time

start = time.time()
myDict = {} # empty dict

with open("markovData.txt") as f:
    contentData = f.readlines()
    contentData = [x.strip() for x in contentData]
    lineIndex = 0
    for line in contentData:
        tmpArray = line.split() #split line to array of words
        tmpArrayIndex = 0
        for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
            if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
                myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1  #increment the frequency of said pair
        else:
            myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
                                                                              #add the pair to list and set frequency to 1.

    #print every 1000th line to check the progress
    lineIndex += 1
    if ((lineIndex % 1000) == 0):
        print(lineIndex)


end = time.time()
print(end - start)

keyText= ""
valueText = ""

for key1,key2 in myDict:
    keyText += (str(key1) + "," + str(key2) + " ")
    valueText += (str(myDict[key1,key2]) + " ")


with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
    markovPairsFile.write(keyText)

with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
    markovFrequencyFile.write(valueText)

2 个答案:

答案 0 :(得分:1)

据我了解,您正在尝试使用n-gram(长度为n的单词词组)的频率来构建隐马尔可夫模型。也许只是尝试一种更有效的可搜索数据结构,例如嵌套字典。格式可能是

{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.

这意味着您只有k ** 2个字典条目,用于2个单词的元组(Google使用最多5个自动翻译),其中k是V(您的(有限)词汇)的基数。这将提高您的性能,因为您不必搜索不断增长的元组列表。 x和y代表出现次数,遇到元组时应增加。 (切勿使用内置函数count()!)

答案 1 :(得分:1)

我还将研究collections.Counter,这是为您的任务制作的数据结构。 Counter对象就像一个字典,但是会计算一次键条目的出现。您可以通过在遇到时简单地增加一个单词对来使用它:

from collections import Counter

word_counts = Counter()
with open("markovData.txt", "r") as f:
    # iterate over word pairs
    word_counts[(word1, word2)] += 1

或者,您可以按自己的方式构造元组列表,然后将其简单地传递到Counter作为对象以计算最后的频率:

word_counts = Counter(word_tuple_list)