我已经从维基百科语料库中提取了一大堆词。一个单词矢量示例:
[ 0.56694877 0.79432029 -0.00573941 0.37489545 -0.11976419 ... 0.76672393]
我的数据是什么样的:
Word Frequency Word Vector Known (label)
165515 vector1 0
626252 vector2 1
.... .... ...
我有一个人的词汇(土耳其语)的数据样本。有些词是已知的,有些是未知的。我试图根据这些数据确定哪个随机词是已知的或未知的。
载体是从Word2Vec(gensim)创建的:
from sklearn.svm import SVC
import csv
import numpy as np
from gensim.models import word2vec
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('text.model.bin', binary=True)
with open('1000.csv', newline='') as csvfile:
listwords = csv.reader(csvfile)
features = []
labels = []
n = 0
for row in listwords:
if n>=199:
break
try:
line = [int(row[2]),np.array(model[row[0]])]
features.append(line)
labels.append(row[1])
n+=1
except KeyError:
pass
features.append([])
labels.append([])
n+=1
clf = SVC()
clf = clf.fit(features, labels)
vocab_obj = model.vocab['anne']
print (clf.predict([vocab_obj.count,model['anne']]))
要做到这一点,我正在使用scikit学习。问题是我无法弄清楚如何使用这些单词向量作为特征。
这是错误:
Traceback (most recent call last):
File "classifier.py", line 28, in <module>
clf = clf.fit(features, labels)
File "/home/mica/.local/lib/python3.5/site-packages/sklearn/svm/base.py", line 151, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "/home/mica/.local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/mica/.local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.