我一直遇到numpy连接的问题,我一直无法理解,并希望有人遇到并解决了同样的问题。我正在尝试将SciKit创建的两个数组连接在一起-Learn的TfidfVectorizer和labelencoder,但得到“数组必须具有相同维数”的错误消息,尽管输入分别是(77946,12157)数组和(77946,1000)数组。 (根据评论的要求,可重复的例子在底部)
TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
tagvect2=preprocessing.LabelBinarizer()
tagvect2.fit(DS['location2'].tolist())
TV.fit(DS['tweet'])
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
A=tagvect2.transform(DS['location2'])
B=TV.transform(DS['tweet'])
print A.shape
print B.shape
pdb.set_trace()
Xtrain=np.concatenate([A,B.todense()],axis=1)
我最初认为B被编码为稀疏矩阵的事实可能导致问题,但将其转换为密集矩阵并不能解决问题。我使用hstack也遇到了同样的问题。
更奇怪的是,添加第三个labelencoder矩阵不会导致错误:
TV.fit(DS['tweet'])
tagvect.fit(DS['state'].tolist())
tagvect2.fit(DS['location'].tolist())
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
Xtrain=pd.DataFrame(np.concatenate([tagvect.transform(DS['state']),tagvect2.transform(DS['location']),TV.transform(DS['tweet'])],axis=1))
以下是错误消息:
Traceback (most recent call last):
File "smallerdimensions.py", line 49, in <module>
Xtrain=pd.DataFrame(np.concatenate((A,B.todense()),axis=1))
ValueError: arrays must have same number of dimensions
感谢您提供的任何帮助。这是一个可重复的例子:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
import numpy as np
tweets=["Jazz for a Rainy Afternoon","RT: @mention: I love rainy days.", "Good Morning Chicago!"]
location=["Oklahoma", "Oklahoma","Illinois"]
DS=pd.DataFrame({"tweet":tweets,"location":location})
TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
DS['location']=DS['location'].fillna("none")
tagvect.fit(DS['location'].tolist())
TV.fit(DS['tweet'])
print "before problem"
print DS['tweet']
print DS['location']
print tagvect.transform(DS['location'])
print tagvect.transform(DS['location']).shape
print TV.transform(DS['tweet']).shape
print TV.transform(DS['tweet'])
print TV.transform(DS['tweet']).todense()
print np.concatenate([tagvect.transform(DS['location']),TV.transform(DS['tweet'])],axis=1)
Numpy是v 1.6.1,pandas是v 0.12.0,scikit是0.14.1。