tfidf矢量化器的前2000个单词的共现矩阵

时间:2018-11-01 16:45:44

标签: python machine-learning nlp similarity tfidfvectorizer

我为文本数据计算了tfidf矢量化器,得到的矢量为(100000,2000)max_feature = 2000。

当我通过以下代码计算共现矩阵时。

length = 2000
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
    print(i)
    print(word)
    for j in range(max(i-window,0),min(i+window,length)):
        print(j)
        print(sentence[j])
        m[word,sentence[j]]+=1
for sentence in tf_vec:
    cal_occ(sentence, m)

我遇到以下错误。

0
(0, 1210)   0.20426932204609685
(0, 191)    0.23516811545499153
(0, 592)    0.2537746177804585
(0, 1927)   0.2896119458034052
(0, 1200)   0.1624114163299802
(0, 1856)   0.24376566018277918
(0, 1325)   0.2789314085220367
(0, 756)    0.15365704375851477
(0, 1130)   0.293489555928974
(0, 346)    0.21231046306681553
(0, 557)    0.2036759579760878
(0, 1036)   0.29666992324872365
(0, 264)    0.36435609585838674
(0, 1701)   0.242619998334931
(0, 1939)   0.33934107208095693
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-96-ad505b6df734> in <module>()
 11             m[word,sentence[j]]+=1
 12 for sentence in tf_vec:
 ---> 13     cal_occ(sentence, m)

 <ipython-input-96-ad505b6df734> in cal_occ(sentence, m)
  9             print(j)
 10             print(sentence[j])
 ---> 11             m[word,sentence[j]]+=1
 12 for sentence in tf_vec:
 13     cal_occ(sentence, m)

IndexError:只有整数,切片(:),省略号(...),numpy.newaxis(None)和整数或布尔数组都是有效索引

1 个答案:

答案 0 :(得分:0)

您最有可能在这里遇到问题:

for j in range(max(i-window,0),min(i+window,length)):

min 函数在 i + window 超出范围时返回长度,您可以尝试使用此方法代替上面的行吗?

for j in range(max(i-window,0),min(i+window,length-1)):

希望这会有所帮助,

欢呼