Question

如果我同时拥有文字和数字值，我想：

将文字转换为数字（我使用<application android:hardwareAccelerated="false" />作为一般示例）
将数字数据转换为相同的比例
将1和2合并为一个CountVectorizer矩阵以传递给估算器

在处理庞大的稀疏矩阵时，如何在考虑内存限制的同时将稀疏矩阵和numpy数组合并到单个X中？

以下是一个示例数据框：

我的目的是将文本转换为数字。

df = pd.DataFrame({
    'Term': [ 'johns company', 'johns company home', 'home repair',
            'home remodeling', 'johns company home repair system',
            'home repair systems', 'home systems', 'repair a home',
            'home remodeling ideas', 'home repair system'],
    'Metric1': [ 319434, 21644, 113185, 73210, 8907, 23016, 36789, 48025, 29624,
               6944],
    'Metric2': [13270, 5015, 4301, 3722, 2502, 2190, 1934, 2468, 2706, 904],
    'Metric3': [ 24170.83, 11034.36, 24137.57, 16548.53, 4777.27, 9565.45,
               8014.29, 9041.97, 7612.31, 4045.37],
    'Metric4': [1.0, 1.1, 2.9, 2.7, 1.1, 2.0, 3.0, 1.9, 1.6, 1.5],
    'y': [712, 406, 297, 215, 190, 0, 125, 100, 94, 93]
    }, columns=['Term', 'Metric1', 'Metric2', 'Metric3', 'Metric4', 'y'])

## df looks like this
                               Term  Metric1  Metric2   Metric3  Metric4    y
0                     johns company   319434    13270  24170.83      1.0  712
1                johns company home    21644     5015  11034.36      1.1  406
2                       home repair   113185     4301  24137.57      2.9  297
3                   home remodeling    73210     3722  16548.53      2.7  215
4  johns company home repair system     8907     2502   4777.27      1.1  190
5               home repair systems    23016     2190   9565.45      2.0    0
6                      home systems    36789     1934   8014.29      3.0  125
7                     repair a home    48025     2468   9041.97      1.9  100
8             home remodeling ideas    29624     2706   7612.31      1.6   94
9                home repair system     6944      904   4045.37      1.5   93

我的目的是规范数字X值。

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
text_features = cv.fit_transform(df['Term'])
text_features
<10x8 sparse matrix of type '<class 'numpy.int64'>'
    with 27 stored elements in Compressed Sparse Row format>

我的目的是加入from sklearn.preprocessing import StandardScaler ss = StandardScaler() num_features = ss.fit_transform(df[['Metric1', 'Metric2', 'Metric3', 'Metric4']]) num_features array([[ 2.81861161, 2.81931317, 1.76781103, -1.22081006], [-0.52069075, 0.3351711 , -0.12390699, -1.08208165], [ 0.50581477, 0.12031011, 1.76302143, 1.41502985], [ 0.05755051, -0.05392589, 0.67016134, 1.13757301], [-0.66351856, -0.42105531, -1.02495954, -1.08208165], [-0.50530567, -0.51494414, -0.33543744, 0.1664741 ], [-0.35086055, -0.59198114, -0.55881232, 1.55375826], [-0.22486438, -0.43128678, -0.41082121, 0.02774568], [-0.4312061 , -0.35966646, -0.61669947, -0.38843957], [-0.68553089, -0.90193466, -1.13035684, -0.52716798]])和text_features，努力使一个num_features传递给估算工具。

我应该尝试使用from sklearn.pipeline import FeatureUnion fu = FeatureUnion([('text', text_features), ('num', num_features)]) from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(fu, df['y']) Traceback (most recent call last): File "<pyshell#230>", line 1, in <module> lr.fit(fu, df['y']) File "C:\Python34\lib\site-packages\sklearn\linear_model\base.py", line 427, in fit y_numeric=True, multi_output=True) File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 510, in check_X_y ensure_min_features, warn_on_dtype, estimator) File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array array = array.astype(np.float64) TypeError: float() argument must be a string or a number, not 'FeatureUnion'将文本和数字数据合并到一个FeatureUnion矩阵中吗？

Answer 1

我认为你误解了FeatureUnion的工作原理。 FeatureUnion应用多个特征提取器/预处理器，并将生成的特征组合到一个矩阵中。由于您没有多个预处理器，而是有多个matricies，因此您应该使用hstack。使用numpy.hstack()它需要两个密集矩阵。如果需要稀疏，请改用scipy.sparse.hstack()。

sklearn将文本系列转换为稀疏矩阵，然后缩放数字，然后合并为单个X.

1 个答案: