我正在尝试在文本分类算法中使用word2vec。 我想使用word2vec创建矢量化器,我使用下面的脚本。但我无法为每个文档获取一行,而是为每个文档获取不同维度的矩阵。 例如,第一个文件矩阵为31X100,第二个163X100和第三个73X100等。 实际上我需要每个文件的尺寸为1X100,这样我就可以将它们用作训练模型的输入功能
任何人都可以帮助我。
import os
import pandas as pd
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # Import the stop word list
import gensim
import numpy as np
train = pd.read_csv("Data.csv",encoding='cp1252')
wordnet_lemmatizer = WordNetLemmatizer()
def Description_to_words(raw_Description):
Description_text = BeautifulSoup(raw_Description).get_text()
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
words = word_tokenize(letters_only.lower())
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return( " ".join(wordnet_lemmatizer.lemmatize(w) for w in meaningful_words))
num_Descriptions = train["Summary"].size
clean_train_Descriptions = []
print("Cleaning and parsing the training set ticket Descriptions...\n")
clean_train_Descriptions = []
for i in range( 0, num_Descriptions ):
if( (i+1)%1000 == 0 ):
print("Description %d of %d\n" % ( i+1, num_Descriptions ))
clean_train_Descriptions.append(Description_to_words( train["Summary"][i] ))
model = gensim.models.Word2Vec(clean_train_Descriptions, size=100)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
#self.dim = len(word2vec.itervalues().next())
self.dim = 100
def fit(self, X, y):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
a=MeanEmbeddingVectorizer(w2v)
clean_train_Descriptions[1]
a.transform(clean_train_Descriptions[1])
train_Descriptions = []
for i in range( 0, num_Descriptions ):
if( (i+1)%1000 == 0 ):
print("Description %d of %d\n" % ( i+1, num_Descriptions ))
train_Descriptions.append(a.transform(" ".join(clean_train_Descriptions[i])))
答案 0 :(得分:1)
您的代码中有2个问题导致问题,很容易解决。
首先,Word2Vec要求句子实际上是单词列表,而不是实际句子作为单个字符串。因此,从您的Description_to_words
返回列表,请不要加入。
return [wordnet_lemmatizer.lemmatize(w) for w in meaningful_words]
由于word2vec遍历每个句子以获取单词,之前它正在迭代一个字符串,而你实际上正在从wv
进行字符级嵌入。
其次,与调用transform的方式类似的问题 - X
应该是文档列表,而不是单个文档。因此,当您执行for words in X
时,实际上是在创建一个字符列表,然后对其进行迭代以创建嵌入。因此,您的输出实际上是句子中每个字符的单个字符嵌入。只需更改,只需一次转换所有文档!
train_Descriptions = a.transform(clean_train_Descriptions)
(一次做一个,包裹在列表中([clean_train_Descriptions[1]]
),或者使用范围选择器(clean_train_Descriptions[1:2]
)选择1。
通过这两项更改,每个输入句子应该返回1行。