用窗口大小的句子语料库计算包含特定前单词的词汇的共现矩阵

时间:2019-08-12 09:30:32

标签: python-3.x machine-learning sparse-matrix countvectorizer

我想从一个具有2000个单词的固定词汇表,一个窗口大小为5的句子的语料库中计算同现矩阵,但是我无法写出如何在该句子中找到vocab词的逻辑然后在保持特定窗口大小的同时维护计数器。

语料库的大小接近81,000(即81,000个句子) vocab的单词列表的长度为2000。

我在类似的问题上提到了该解决方案,但是在增加计数器或计数值之前,这与我的词汇表中的单词与语料库中的单词不匹配/比较 Co occurance matrix for tfidf vectorizer for top 2000 words

length = 2000
vocab = top_2k_words  #This is a list of top 2000 words 
sentence = corpus
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    #some code to check if the word is present in vocab along with other 
    #vocab words in window size of 5
    for i,word in enumerate(sentence):
    print(i)
    print(word)
    for j in range(max(i-window,0),min(i+window,length-1)):
        print(j)
        print(sentence[j])
        m[word,sentence[j]]+=1
for sentence in tf_vec:
    cal_occ(sentence, m)

期望矩阵将是2000x2000尺寸的稀疏矩阵。

0 个答案:

没有答案