我想从一个具有2000个单词的固定词汇表,一个窗口大小为5的句子的语料库中计算同现矩阵,但是我无法写出如何在该句子中找到vocab词的逻辑然后在保持特定窗口大小的同时维护计数器。
语料库的大小接近81,000(即81,000个句子) vocab的单词列表的长度为2000。
我在类似的问题上提到了该解决方案,但是在增加计数器或计数值之前,这与我的词汇表中的单词与语料库中的单词不匹配/比较 Co occurance matrix for tfidf vectorizer for top 2000 words
length = 2000
vocab = top_2k_words #This is a list of top 2000 words
sentence = corpus
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
#some code to check if the word is present in vocab along with other
#vocab words in window size of 5
for i,word in enumerate(sentence):
print(i)
print(word)
for j in range(max(i-window,0),min(i+window,length-1)):
print(j)
print(sentence[j])
m[word,sentence[j]]+=1
for sentence in tf_vec:
cal_occ(sentence, m)
期望矩阵将是2000x2000尺寸的稀疏矩阵。