我正在尝试实现我需要BoW特征向量的BoW模型。我有几个火车文件夹,其中包含几个文本文件。我创建了整个训练数据的词汇。我现在逐个传递整个词汇和文档。但我不知道该怎么办。我按照一个来源实现了这段代码:
corpus = dict(((term, index) for index, term in enumerate(sorted(total_vocab)))
fvs = [[0]*num_words for _ in input]
print(fvs)
for i, doc_terms in enumerate(input):
fv = fvs[i]
for term in doc_terms:
fv[corpus[term]] += 1
它给了我一些90,000大小的矢量全零。
为了获得特征向量,我该怎么做?
以下是该课程的片段:
class FeatureVector(object):
def __init__(self,vocabsize,numdata):
self.vocabsize = vocabsize
self.X = np.zeros((numdata,self.vocabsize), dtype=np.int)
self.Y = np.zeros((numdata,), dtype=np.int)
def make_featurevector(self, input, classid,total_vocab):
corpus = dict(((term, index) for index, term in enumerate(sorted(total_vocab)))
#fvs = [[0]*num_words for _ in input]
#print(fvs)
#for i, doc_terms in enumerate(input):
#fv = fvs[i]
#for term in doc_terms:
#fv[corpus[term]] += 1
#var = np.zeros((len(total_vocab),1),dtype=np.int)
#for raw_word in input:
# for eachkey in total_vocab.keys():
# if eachkey in raw_word:
# var[raw_word] += 1
#print fvs
"""
Takes input tihe documents and outputs the feature vectors as X and classids as Y.
"""