我正在尝试创建一个简单的模型来预测句子中的下一个单词。我有一个很大的.txt文件,其中包含以'\ n'分隔的句子。我还有一个词汇文件,其中列出了.txt文件中的每个唯一单词和唯一ID。我使用词汇表文件将语料库中的单词转换为相应的ID。现在,我想建立一个简单的模型,该模型从txt文件中读取ID,并找到单词对以及在语料库中看到该单词对的次数。我设法写了下面的代码:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
else:
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
此代码对于前几千行似乎运行良好。然后,事情开始变慢,因为元组列表不断变大,我必须搜索整个元组列表,以检查是否之前或之前都没有看到下一个单词对。我在30分钟内设法获得了5万行的数据,但是我的语料库更大,有数百万行。有没有一种方法可以更有效地存储和搜索单词对?矩阵的工作速度可能会快很多,但我的唯一字数约为300.000个字。这意味着我必须创建一个以整数作为数据类型的300k * 300k矩阵。即使在利用对称矩阵之后,它仍然需要比我多的很多内存。
我尝试使用numpy中的memmap将矩阵存储在磁盘而不是内存中,但是它需要大约500 GB的可用磁盘空间。
然后,我研究了稀疏矩阵,发现我可以存储非零值及其对应的行号和列号。这就是我在代码中所做的。
目前,此模型可以运行,但是很难正确猜出下一个单词(成功率约为8%)。我需要训练更大的语料才能取得更好的结果。如何使这个词对查找代码更有效?
谢谢。
编辑:感谢大家的回答,我现在可以在15秒钟内处理约50万行的语料库。我将为类似问题的人添加以下代码的最终版本:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
else:
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
markovPairsFile.write(keyText)
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
markovFrequencyFile.write(valueText)
答案 0 :(得分:1)
据我了解,您正在尝试使用n-gram(长度为n的单词词组)的频率来构建隐马尔可夫模型。也许只是尝试一种更有效的可搜索数据结构,例如嵌套字典。格式可能是
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
这意味着您只有k ** 2个字典条目,用于2个单词的元组(Google使用最多5个自动翻译),其中k是V(您的(有限)词汇)的基数。这将提高您的性能,因为您不必搜索不断增长的元组列表。 x和y代表出现次数,遇到元组时应增加。 (切勿使用内置函数count()!)
答案 1 :(得分:1)
我还将研究collections.Counter
,这是为您的任务制作的数据结构。 Counter
对象就像一个字典,但是会计算一次键条目的出现。您可以通过在遇到时简单地增加一个单词对来使用它:
from collections import Counter
word_counts = Counter()
with open("markovData.txt", "r") as f:
# iterate over word pairs
word_counts[(word1, word2)] += 1
或者,您可以按自己的方式构造元组列表,然后将其简单地传递到Counter作为对象以计算最后的频率:
word_counts = Counter(word_tuple_list)