我在尝试将doc2vec
模型应用于某些文本时遇到错误。我关注的教程是here。但是,我似乎无法在某些新的文本信息上“复制”结果。
我已经阅读了有关此问题及其原因的其他SO帖子,因为我有一个空列表,但我不知道为什么会有这个空列表。
代码:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import csv
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
list_id = list(df["id"])
list_def = list(df["text"])
tagged_data = [TaggedDocument(words=word_tokenize(term_def.lower()), tags=[list_id[i]]) for i, term_def in enumerate(list_def)]
max_epochs = 500
vec_size = 100
alpha = 0.025
model = Doc2Vec(vector_size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm=1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
if epoch % 100 == 0:
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs)
model.alpha -= 0.0002
model.min_alpha = model.alpha
doc_tags = list(model.docvecs.doctags.keys())
X = model[doc_tags]
我遇到的错误与代码的最后两行有关。
ValueError: need at least one array to concatenate
数据:
d = {'text': ["Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.[1][2] These different variations of football are known as football codes.", "Rugby union, commonly known in most of the world simply as rugby,[3] is a contact team sport which originated in England in the first half of the 19th century.[4] One of the two codes of rugby football, it is based on running with the ball in hand. In its most common form, a game is between two teams of 15 players using an oval-shaped ball on a rectangular field with H-shaped goalposts at each end.", "Tennis is a racket sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to maneuver the ball in such a way that the opponent is not able to play a valid return. The player who is unable to return the ball will not gain a point, while the opposite player will.", "Formula One (also Formula 1 or F1) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. The word formula in the name refers to the set of rules to which all participants' cars must conform.[1] A Formula One season consists of a series of races, known as Grands Prix (French for 'grand prizes' or 'great prizes'), which take place worldwide on purpose-built circuits and on public roads."], 'id': [123, 1234, 12345, 123456]}
df = pd.DataFrame(data=d)
编辑:
iteration 0
iteration 100
iteration 200
iteration 300
iteration 400
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-45c9e3dc04ad> in <module>()
36
37 doc_tags = list(model.docvecs.doctags.keys())
---> 38 X = model[doc_tags]
~\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in __getitem__(self, tag)
961 return self.docvecs[tag]
962 return self.wv[tag]
--> 963 return vstack([self[i] for i in tag])
964
965 def __str__(self):
~\Anaconda3\lib\site-packages\numpy\core\shape_base.py in vstack(tup)
232
233 """
--> 234 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
235
236 def hstack(tup):
ValueError: need at least one array to concatenate