我尝试用矢量将数据表示为矢量。
x_train.shape =(23931,0)。
当我适合并转换数据时,我得到:
(0,17032)0.4519992833718229
(0,12962)0.6307521889900021
(0,4736)0.6307521889900021
(1,11281)0.4672884777844598
(1,27612)0.5073391887405501
(1,5332)0.600334059729709
(1,7620)0.404780734257955
(2,11281)0.4642233618674704
我已经尝试使用x_train = my_data.head()尝试代码中的所有步骤:
array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]])
它有效了
import pandas as pd
import re
import numpy as np
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
df = pd.read_excel('./data_file')
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-zа-яё #+_]')
def preprocess_text(text):
tokens = REPLACE_BY_SPACE_RE.sub(' ', text)
tokens = BAD_SYMBOLS_RE.sub('', tokens)
tokens = re.sub(r'[?|!|\'|"|#]',r'',tokens)
tokens = re.sub(r'[.|,|)|(|\|/]',r' ',tokens)
return tokens
x_train = df['data train name'].apply(preprocess_text)
vectorizer = TfidfVectorizer(ngram_range=(2,2))
x = vectorizer.fit_transform(x_train)
x.toarray()
但是完整数据系列的toarray方法的结果:
(None, array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]))
我不明白为什么输出是零矩阵。