为什么在我调用.toarray()方法时收到零矩阵

时间:2019-07-15 08:55:41

标签: python deep-learning nlp tf-idf n-gram

我尝试用矢量将数据表示为矢量。

  

x_train.shape =(23931,0)。

当我适合并转换数据时,我得到:

  

(0,17032)0.4519992833718229

(0,12962)0.6307521889900021

(0,4736)0.6307521889900021

(1,11281)0.4672884777844598

(1,27612)0.5073391887405501

(1,5332)0.600334059729709

(1,7620)0.404780734257955

(2,11281)0.4642233618674704

我已经尝试使用x_train = my_data.head()尝试代码中的所有步骤:

array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1],
       [0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]])

它有效了

import pandas as pd
import re
import numpy as np
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

df = pd.read_excel('./data_file')

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-zа-яё #+_]')

def preprocess_text(text):
    tokens = REPLACE_BY_SPACE_RE.sub(' ', text) 
    tokens = BAD_SYMBOLS_RE.sub('', tokens)
    tokens = re.sub(r'[?|!|\'|"|#]',r'',tokens)
    tokens = re.sub(r'[.|,|)|(|\|/]',r' ',tokens)  
    return tokens

x_train = df['data train name'].apply(preprocess_text)
vectorizer = TfidfVectorizer(ngram_range=(2,2))
x = vectorizer.fit_transform(x_train)
x.toarray()

但是完整数据系列的toarray方法的结果:

(None, array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

我不明白为什么输出是零矩阵。

0 个答案:

没有答案