在python

时间:2017-06-09 04:05:11

标签: python gensim word2vec

我通过在处理时使用gensim模型训练数据来制作doc2vec文件。我收到了一个错误。 我正在运行以下代码: -

model = Doc2Vec.load('sentiment140.d2v')

if len(sys.argv) < 4:
    print ("Please input train_pos_count, train_neg_count and classifier!")
    sys.exit()

train_pos_count = int(sys.argv[1])
train_neg_count = int(sys.argv[2])
test_pos_count = 144
test_neg_count = 144

print (train_pos_count)
print (train_neg_count)

vec_dim = 100

print ("Build training data set...")
train_arrays = numpy.zeros((train_pos_count + train_neg_count, vec_dim))
train_labels = numpy.zeros(train_pos_count + train_neg_count)

for i in range(train_pos_count):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    train_arrays[i] = model.docvecs[prefix_train_pos]
    train_labels[i] = 1

for i in range(train_neg_count):
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[train_pos_count + i] = model.docvecs[prefix_train_neg]
    train_labels[train_pos_count + i] = 0


print ("Build testing data set...")
test_arrays = numpy.zeros((test_pos_count + test_neg_count, vec_dim))
test_labels = numpy.zeros(test_pos_count + test_neg_count)

for i in range(test_pos_count):
    prefix_test_pos = 'TEST_POS_' + str(i)
    test_arrays[i] = model.docvecs[prefix_test_pos]
    test_labels[i] = 1

for i in range(test_neg_count):
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[test_pos_count + i] = model.docvecs[prefix_test_neg]
    test_labels[test_pos_count + i] = 0


print ("Begin classification...")
classifier = None
if sys.argv[3] == '-lr':
    print ("Logistic Regressions is used...")
    classifier = LogisticRegression()
elif sys.argv[3] == '-svm':
    print ("Support Vector Machine is used...")
    classifier = SVC()
elif sys.argv[3] == '-knn':
    print ("K-Nearest Neighbors is used...")
    classifier = KNeighborsClassifier(n_neighbors=10)
elif sys.argv[3] == '-rf':
    print ("Random Forest is used...")
    classifier = RandomForestClassifier()

classifier.fit(train_arrays, train_labels)

print ("Accuracy:", classifier.score(test_arrays, test_labels))

我得到了一个Keyerror - “TEST_POS_72”ERROR

我想知道我做错了什么。

1 个答案:

答案 0 :(得分:0)

错误意味着字面意思是没有带有密钥(&#39;标记&#39;)TEST_POS_72的doc-vector是模型的一部分。在培训期间,不得出现带有该标签的任何文件。

您可以在model.docvecs.offset2doctag中看到模型中所有已知文档标记的列表。如果TEST_POS_72不存在,则无法通过model.docvecs['TEST_POS_72']访问doc-vector。 (如果该列表为空,则训练doc-vectors以通过普通int键访问 - 并且model.docvecs[72]将是访问doc-vector的更合适的方式。)

(另外,Doc2Vec在几百个文档的小型语料库中运行良好,并且屏幕截图中的警告和#34;慢速版本的gensim.models.doc2vec正在使用&#34;意味着gensim& #39;优化的C编译例程不是安装的一部分,训练速度将是100倍或更慢。)