numpy连接维度不匹配

时间:2013-11-09 18:55:17

标签: numpy concatenation

我一直遇到numpy连接的问题,我一直无法理解,并希望有人遇到并解决了同样的问题。我正在尝试将SciKit创建的两个数组连接在一起-Learn的TfidfVectorizer和labelencoder,但得到“数组必须具有相同维数”的错误消息,尽管输入分别是(77946,12157)数组和(77946,1000)数组。 (根据评论的要求,可重复的例子在底部)

TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
tagvect2=preprocessing.LabelBinarizer()

tagvect2.fit(DS['location2'].tolist())
TV.fit(DS['tweet'])
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
A=tagvect2.transform(DS['location2'])
B=TV.transform(DS['tweet'])
print A.shape
print B.shape
pdb.set_trace()
Xtrain=np.concatenate([A,B.todense()],axis=1)

我最初认为B被编码为稀疏矩阵的事实可能导致问题,但将其转换为密集矩阵并不能解决问题。我使用hstack也遇到了同样的问题。

更奇怪的是,添加第三个labelencoder矩阵不会导致错误:

TV.fit(DS['tweet'])
tagvect.fit(DS['state'].tolist())
tagvect2.fit(DS['location'].tolist())
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
Xtrain=pd.DataFrame(np.concatenate([tagvect.transform(DS['state']),tagvect2.transform(DS['location']),TV.transform(DS['tweet'])],axis=1))

以下是错误消息:

  Traceback (most recent call last):
  File "smallerdimensions.py", line 49, in <module>
    Xtrain=pd.DataFrame(np.concatenate((A,B.todense()),axis=1))
ValueError: arrays must have same number of dimensions

感谢您提供的任何帮助。这是一个可重复的例子:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
import numpy as np


tweets=["Jazz for a Rainy Afternoon","RT: @mention: I love rainy days.", "Good Morning Chicago!"]
location=["Oklahoma", "Oklahoma","Illinois"]

DS=pd.DataFrame({"tweet":tweets,"location":location})



TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()

DS['location']=DS['location'].fillna("none")

tagvect.fit(DS['location'].tolist())
TV.fit(DS['tweet'])
print "before problem"
print DS['tweet']
print DS['location']
print tagvect.transform(DS['location'])
print tagvect.transform(DS['location']).shape
print TV.transform(DS['tweet']).shape
print TV.transform(DS['tweet'])
print TV.transform(DS['tweet']).todense()
print np.concatenate([tagvect.transform(DS['location']),TV.transform(DS['tweet'])],axis=1)

Numpy是v 1.6.1,pandas是v 0.12.0,scikit是0.14.1。

0 个答案:

没有答案