Question

我一直遇到numpy连接的问题，我一直无法理解，并希望有人遇到并解决了同样的问题。我正在尝试将SciKit创建的两个数组连接在一起-Learn的TfidfVectorizer和labelencoder，但得到“数组必须具有相同维数”的错误消息，尽管输入分别是（77946,12157）数组和（77946,1000）数组。（根据评论的要求，可重复的例子在底部）

TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
tagvect2=preprocessing.LabelBinarizer()

tagvect2.fit(DS['location2'].tolist())
TV.fit(DS['tweet'])
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
A=tagvect2.transform(DS['location2'])
B=TV.transform(DS['tweet'])
print A.shape
print B.shape
pdb.set_trace()
Xtrain=np.concatenate([A,B.todense()],axis=1)

我最初认为B被编码为稀疏矩阵的事实可能导致问题，但将其转换为密集矩阵并不能解决问题。我使用hstack也遇到了同样的问题。

更奇怪的是，添加第三个labelencoder矩阵不会导致错误：

TV.fit(DS['tweet'])
tagvect.fit(DS['state'].tolist())
tagvect2.fit(DS['location'].tolist())
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
Xtrain=pd.DataFrame(np.concatenate([tagvect.transform(DS['state']),tagvect2.transform(DS['location']),TV.transform(DS['tweet'])],axis=1))

以下是错误消息：

  Traceback (most recent call last):
  File "smallerdimensions.py", line 49, in <module>
    Xtrain=pd.DataFrame(np.concatenate((A,B.todense()),axis=1))
ValueError: arrays must have same number of dimensions

感谢您提供的任何帮助。这是一个可重复的例子：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
import numpy as np


tweets=["Jazz for a Rainy Afternoon","RT: @mention: I love rainy days.", "Good Morning Chicago!"]
location=["Oklahoma", "Oklahoma","Illinois"]

DS=pd.DataFrame({"tweet":tweets,"location":location})



TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()

DS['location']=DS['location'].fillna("none")

tagvect.fit(DS['location'].tolist())
TV.fit(DS['tweet'])
print "before problem"
print DS['tweet']
print DS['location']
print tagvect.transform(DS['location'])
print tagvect.transform(DS['location']).shape
print TV.transform(DS['tweet']).shape
print TV.transform(DS['tweet'])
print TV.transform(DS['tweet']).todense()
print np.concatenate([tagvect.transform(DS['location']),TV.transform(DS['tweet'])],axis=1)

Numpy是v 1.6.1，pandas是v 0.12.0，scikit是0.14.1。

numpy连接维度不匹配

0 个答案: