我正在使用 TfidfVectorizer 对数据进行矢量化。这是我的代码:
# read from txt file per line and split with whitespace
corpus = []
for line in open(address, 'r').readlines():
corpus.append(line.strip())
# TF-IDF
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
X = tfidf_matrix.toarray()
# generate the linkage matrix
Z = linkage(X, 'average')
# set cut-off to 1
max_d = 1 # max_d as in max_distance
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=50,
leaf_rotation=90,
leaf_font_size=10,
show_contracted=True,
annotate_above=1, # useful in small plots so annotations don't overlap
max_d = max_d,
)
plt.show()
但是我明白了:
回溯(最近通话最近): 文件“ C:/用户/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”, 第108行,在 tf_idf(source_file_addr) 文件“ C:/用户/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”, 第80行,在tf_idf中 Z =连锁度(X,'平均值') 文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ cluster \ hierarchy.py”, 708行,处于链接状态 y = distance.pdist(y,metric) 文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ spatial \ distance.py”, 1877行,在pdist中 dm = np.empty((m *(m-1))// 2,dtype = np.double) MemoryError
如果我删除toarray(),我会得到:
回溯(最近通话最近): 文件“ C:/Users/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”, 第107行,在 tf_idf(source_file_addr) 文件“ C:/Users/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”, 第79行,在tf_idf中 Z =连锁度(X,'平均值') 文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ cluster \ hierarchy.py”, 694行,处于链接状态 y = _convert_to_double(np.asarray(y,order ='c')) 文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ cluster \ hierarchy.py”, 第1216行,在_convert_to_double中 X = X.astype(np.double) ValueError:设置具有序列的数组元素。
我尝试使用:
HashingVectorizer(n_features=200, norm=None)
我开始犯错了:
ValueError: setting an array element with a sequence.
我只有8GB的RAM,我的文本文件包含100000行。
有人可以帮我解决这个问题吗?非常感谢。