Question

我在特定领域中拥有大量的句子。我正在寻找可以提供数据的开源代码/程序包，它将生成一个良好，可靠的语言模型。（意思是给定上下文，它知道每个单词的概率）。

有这样的代码/项目吗？

我看到了这个github仓库：https://github.com/rafaljozefowicz/lm，但是没有用。

Answer 1

我建议编写您自己的基本实现。首先，我们需要一些句子：

import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())

sentences现在是列表列表。每个子列表代表一个以每个单词为元素的句子。现在，您需要确定是否要在模型中包括标点符号。如果要删除它，请尝试以下操作：

punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
    new_sentence = [word for word in sentence if word not in punctuation]
    sentences[i] = new_sentence

接下来，您需要确定是否关心大写。如果您不关心它，可以这样删除它：

for i, sentence in enumerate(sentences.copy()):
    new_sentence = list()
    for j, word in enumerate(sentence.copy()):
        new_word = word.lower() # Lower case all characters in word
        new_sentence.append(new_word)
    sentences[i] = new_sentence

接下来，我们需要特殊的 start 和 end 单词来表示在句子开头和结尾处均有效的单词。您应该选择训练数据中不存在的开始和结束单词。

start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
    new_sentence = start + sentence + end
    sentences[i] = new_sentence

现在，让我们数一字二字。字母组合是句子中一个单词的序列。是的，一个字母组合模型只是语料库中每个单词的频率分布：

new_words = list()
for sentence in sentences:
    for word in sentence:
        new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)

现在是时候计算二元组了。双连词是句子中两个单词的序列。因此，对于句子“我是海象” ，我们有以下双字母组：” <> i” ，“我是” ， “上午” ，“海象” 和“海象<>” 。

bigrams = list()
for sentence in sentences:
    new_bigrams = nltk.bigrams(sentence)
    bigrams += new_bigrams

现在我们可以创建频率分布：

bigram_fdist = nltk.ConditionalFreqDist(bigrams)

最后，我们想知道模型中每个单词的概率：

def getUnigramProbability(word):
    if word in unigram_fdist:
        return unigram_fdist[word]/total_words
    else:
        return -1 # You should figure out how you want to handle out-of-vocabulary words

def getBigramProbability(word1, word2):
    if word1 not in bigram_fdist:
        return -1 # You should figure out how you want to handle out-of-vocabulary words
    elif word2 not in bigram_fdist[word1]:
        # i.e. "word1 word2" never occurs in the corpus
        return getUnigramProbability(word2)
    else:
        bigram_frequency = bigram_fdist[word1][word2]
        unigram_frequency = unigram_fdist[word1]
        bigram_probability = bigram_frequency / unigram_frequency
        return bigram_probability

虽然这不是仅为您构建模型的框架/库，但我希望看到这段代码能使语言模型中发生的事情变得神秘。

Answer 2

您可以尝试使用PyTorch示例中的word_language_model。如果您的语料库很大，可能会有问题。它们将所有数据加载到内存中。

通过我自己的语料库创建可靠的语言模型的代码

2 个答案: