如果我同时拥有文字和数字值,我想:
<application
android:hardwareAccelerated="false"
/>
作为一般示例)CountVectorizer
矩阵以传递给估算器在处理庞大的稀疏矩阵时,如何在考虑内存限制的同时将稀疏矩阵和numpy数组合并到单个X
中?
以下是一个示例数据框:
X
我的目的是将文本转换为数字。
df = pd.DataFrame({
'Term': [ 'johns company', 'johns company home', 'home repair',
'home remodeling', 'johns company home repair system',
'home repair systems', 'home systems', 'repair a home',
'home remodeling ideas', 'home repair system'],
'Metric1': [ 319434, 21644, 113185, 73210, 8907, 23016, 36789, 48025, 29624,
6944],
'Metric2': [13270, 5015, 4301, 3722, 2502, 2190, 1934, 2468, 2706, 904],
'Metric3': [ 24170.83, 11034.36, 24137.57, 16548.53, 4777.27, 9565.45,
8014.29, 9041.97, 7612.31, 4045.37],
'Metric4': [1.0, 1.1, 2.9, 2.7, 1.1, 2.0, 3.0, 1.9, 1.6, 1.5],
'y': [712, 406, 297, 215, 190, 0, 125, 100, 94, 93]
}, columns=['Term', 'Metric1', 'Metric2', 'Metric3', 'Metric4', 'y'])
## df looks like this
Term Metric1 Metric2 Metric3 Metric4 y
0 johns company 319434 13270 24170.83 1.0 712
1 johns company home 21644 5015 11034.36 1.1 406
2 home repair 113185 4301 24137.57 2.9 297
3 home remodeling 73210 3722 16548.53 2.7 215
4 johns company home repair system 8907 2502 4777.27 1.1 190
5 home repair systems 23016 2190 9565.45 2.0 0
6 home systems 36789 1934 8014.29 3.0 125
7 repair a home 48025 2468 9041.97 1.9 100
8 home remodeling ideas 29624 2706 7612.31 1.6 94
9 home repair system 6944 904 4045.37 1.5 93
我的目的是规范数字X值。
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
text_features = cv.fit_transform(df['Term'])
text_features
<10x8 sparse matrix of type '<class 'numpy.int64'>'
with 27 stored elements in Compressed Sparse Row format>
我的目的是加入from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
num_features = ss.fit_transform(df[['Metric1', 'Metric2', 'Metric3', 'Metric4']])
num_features
array([[ 2.81861161, 2.81931317, 1.76781103, -1.22081006],
[-0.52069075, 0.3351711 , -0.12390699, -1.08208165],
[ 0.50581477, 0.12031011, 1.76302143, 1.41502985],
[ 0.05755051, -0.05392589, 0.67016134, 1.13757301],
[-0.66351856, -0.42105531, -1.02495954, -1.08208165],
[-0.50530567, -0.51494414, -0.33543744, 0.1664741 ],
[-0.35086055, -0.59198114, -0.55881232, 1.55375826],
[-0.22486438, -0.43128678, -0.41082121, 0.02774568],
[-0.4312061 , -0.35966646, -0.61669947, -0.38843957],
[-0.68553089, -0.90193466, -1.13035684, -0.52716798]])
和text_features
,努力使一个num_features
传递给估算工具。
X
我应该尝试使用from sklearn.pipeline import FeatureUnion
fu = FeatureUnion([('text', text_features), ('num', num_features)])
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(fu, df['y'])
Traceback (most recent call last):
File "<pyshell#230>", line 1, in <module>
lr.fit(fu, df['y'])
File "C:\Python34\lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
y_numeric=True, multi_output=True)
File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 510, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array
array = array.astype(np.float64)
TypeError: float() argument must be a string or a number, not 'FeatureUnion'
将文本和数字数据合并到一个FeatureUnion
矩阵中吗?
答案 0 :(得分:1)
我认为你误解了FeatureUnion
的工作原理。 FeatureUnion
应用多个特征提取器/预处理器,并将生成的特征组合到一个矩阵中。由于您没有多个预处理器,而是有多个matricies,因此您应该使用hstack
。使用numpy.hstack()
它需要两个密集矩阵。如果需要稀疏,请改用scipy.sparse.hstack()
。