Question

我正在尝试创建一个简单的模型来预测句子中的下一个单词。我有一个很大的.txt文件，其中包含以'\ n'分隔的句子。我还有一个词汇文件，其中列出了.txt文件中的每个唯一单词和唯一ID。我使用词汇表文件将语料库中的单词转换为相应的ID。现在，我想建立一个简单的模型，该模型从txt文件中读取ID，并找到单词对以及在语料库中看到该单词对的次数。我设法写了下面的代码：

tuples = [[]] #array for word tuples to be stored in
data = []   #array for tuple frequencies to be stored in

data.append(0) #tuples array starts with an empty element at the beginning for some reason.
            # Adding zero to the beginning of the frequency array levels the indexes of the two arrays

with open("markovData.txt") as f:
    contentData = f.readlines()
    contentData = [x.strip() for x in contentData]
    lineIndex = 0
    for line in contentData:
        tmpArray = line.split() #split line to array of words
        tupleIndex = 0
        tmpArrayIndex = 0
        for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
            if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
                data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
            else:
                tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
                data.append(1)                                                        #add the pair to list and set frequency to 1.

        #print every 1000th line to check the progress
        lineIndex += 1
        if ((lineIndex % 1000) == 0):
            print(lineIndex)

with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
    #write tuples to txt file
    for tuple in tuples:
        if (len(tuple) > 0): # if tuple is not epmty
            markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")

    markovWindowSize1File.write("\n")
    markovWindowSize1File.write("\n")
    #blank spaces between two data

    #write frequencies of the tuples to txt file
    for element in data:
        markovWindowSize1File.write(str(element) + " ")

    markovWindowSize1File.write("\n")
    markovWindowSize1File.write("\n")

此代码对于前几千行似乎运行良好。然后，事情开始变慢，因为元组列表不断变大，我必须搜索整个元组列表，以检查是否之前或之前都没有看到下一个单词对。我在30分钟内设法获得了5万行的数据，但是我的语料库更大，有数百万行。有没有一种方法可以更有效地存储和搜索单词对？矩阵的工作速度可能会快很多，但我的唯一字数约为300.000个字。这意味着我必须创建一个以整数作为数据类型的300k * 300k矩阵。即使在利用对称矩阵之后，它仍然需要比我多的很多内存。

我尝试使用numpy中的memmap将矩阵存储在磁盘而不是内存中，但是它需要大约500 GB的可用磁盘空间。

然后，我研究了稀疏矩阵，发现我可以存储非零值及其对应的行号和列号。这就是我在代码中所做的。

目前，此模型可以运行，但是很难正确猜出下一个单词（成功率约为8％）。我需要训练更大的语料才能取得更好的结果。如何使这个词对查找代码更有效？

谢谢。

编辑：感谢大家的回答，我现在可以在15秒钟内处理约50万行的语料库。我将为类似问题的人添加以下代码的最终版本：

import numpy as np
import time

start = time.time()
myDict = {} # empty dict

with open("markovData.txt") as f:
    contentData = f.readlines()
    contentData = [x.strip() for x in contentData]
    lineIndex = 0
    for line in contentData:
        tmpArray = line.split() #split line to array of words
        tmpArrayIndex = 0
        for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
            if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
                myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1  #increment the frequency of said pair
        else:
            myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
                                                                              #add the pair to list and set frequency to 1.

    #print every 1000th line to check the progress
    lineIndex += 1
    if ((lineIndex % 1000) == 0):
        print(lineIndex)


end = time.time()
print(end - start)

keyText= ""
valueText = ""

for key1,key2 in myDict:
    keyText += (str(key1) + "," + str(key2) + " ")
    valueText += (str(myDict[key1,key2]) + " ")


with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
    markovPairsFile.write(keyText)

with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
    markovFrequencyFile.write(valueText)

Answer 1

据我了解，您正在尝试使用n-gram（长度为n的单词词组）的频率来构建隐马尔可夫模型。也许只是尝试一种更有效的可搜索数据结构，例如嵌套字典。格式可能是

{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.

这意味着您只有k ** 2个字典条目，用于2个单词的元组（Google使用最多5个自动翻译），其中k是V（您的（有限）词汇）的基数。这将提高您的性能，因为您不必搜索不断增长的元组列表。 x和y代表出现次数，遇到元组时应增加。（切勿使用内置函数count（）！）

Answer 2

我还将研究collections.Counter，这是为您的任务制作的数据结构。 Counter对象就像一个字典，但是会计算一次键条目的出现。您可以通过在遇到时简单地增加一个单词对来使用它：

from collections import Counter

word_counts = Counter()
with open("markovData.txt", "r") as f:
    # iterate over word pairs
    word_counts[(word1, word2)] += 1

或者，您可以按自己的方式构造元组列表，然后将其简单地传递到Counter作为对象以计算最后的频率：

word_counts = Counter(word_tuple_list)

如何加快此单词组查找算法的速度？

2 个答案: