sklearn将文本系列转换为稀疏矩阵,然后缩放数字,然后合并为单个X.

时间:2016-01-05 06:10:05

标签: python pandas scikit-learn

如果我同时拥有文字和数字值,我想:

  1. 将文字转换为数字(我使用<application android:hardwareAccelerated="false" /> 作为一般示例)
  2. 将数字数据转换为相同的比例
  3. 将1和2合并为一个CountVectorizer矩阵以传递给估算器
  4. 在处理庞大的稀疏矩阵时,如何在考虑内存限制的同时将稀疏矩阵和numpy数组合并到单个X中?

    以下是一个示例数据框:

    X

    我的目的是将文本转换为数字。

    df = pd.DataFrame({
        'Term': [ 'johns company', 'johns company home', 'home repair',
                'home remodeling', 'johns company home repair system',
                'home repair systems', 'home systems', 'repair a home',
                'home remodeling ideas', 'home repair system'],
        'Metric1': [ 319434, 21644, 113185, 73210, 8907, 23016, 36789, 48025, 29624,
                   6944],
        'Metric2': [13270, 5015, 4301, 3722, 2502, 2190, 1934, 2468, 2706, 904],
        'Metric3': [ 24170.83, 11034.36, 24137.57, 16548.53, 4777.27, 9565.45,
                   8014.29, 9041.97, 7612.31, 4045.37],
        'Metric4': [1.0, 1.1, 2.9, 2.7, 1.1, 2.0, 3.0, 1.9, 1.6, 1.5],
        'y': [712, 406, 297, 215, 190, 0, 125, 100, 94, 93]
        }, columns=['Term', 'Metric1', 'Metric2', 'Metric3', 'Metric4', 'y'])
    
    ## df looks like this
                                   Term  Metric1  Metric2   Metric3  Metric4    y
    0                     johns company   319434    13270  24170.83      1.0  712
    1                johns company home    21644     5015  11034.36      1.1  406
    2                       home repair   113185     4301  24137.57      2.9  297
    3                   home remodeling    73210     3722  16548.53      2.7  215
    4  johns company home repair system     8907     2502   4777.27      1.1  190
    5               home repair systems    23016     2190   9565.45      2.0    0
    6                      home systems    36789     1934   8014.29      3.0  125
    7                     repair a home    48025     2468   9041.97      1.9  100
    8             home remodeling ideas    29624     2706   7612.31      1.6   94
    9                home repair system     6944      904   4045.37      1.5   93
    

    我的目的是规范数字X值。

    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer()
    text_features = cv.fit_transform(df['Term'])
    text_features
    <10x8 sparse matrix of type '<class 'numpy.int64'>'
        with 27 stored elements in Compressed Sparse Row format>
    

    我的目的是加入from sklearn.preprocessing import StandardScaler ss = StandardScaler() num_features = ss.fit_transform(df[['Metric1', 'Metric2', 'Metric3', 'Metric4']]) num_features array([[ 2.81861161, 2.81931317, 1.76781103, -1.22081006], [-0.52069075, 0.3351711 , -0.12390699, -1.08208165], [ 0.50581477, 0.12031011, 1.76302143, 1.41502985], [ 0.05755051, -0.05392589, 0.67016134, 1.13757301], [-0.66351856, -0.42105531, -1.02495954, -1.08208165], [-0.50530567, -0.51494414, -0.33543744, 0.1664741 ], [-0.35086055, -0.59198114, -0.55881232, 1.55375826], [-0.22486438, -0.43128678, -0.41082121, 0.02774568], [-0.4312061 , -0.35966646, -0.61669947, -0.38843957], [-0.68553089, -0.90193466, -1.13035684, -0.52716798]]) text_features,努力使一个num_features传递给估算工具。

    X

    我应该尝试使用from sklearn.pipeline import FeatureUnion fu = FeatureUnion([('text', text_features), ('num', num_features)]) from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(fu, df['y']) Traceback (most recent call last): File "<pyshell#230>", line 1, in <module> lr.fit(fu, df['y']) File "C:\Python34\lib\site-packages\sklearn\linear_model\base.py", line 427, in fit y_numeric=True, multi_output=True) File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 510, in check_X_y ensure_min_features, warn_on_dtype, estimator) File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array array = array.astype(np.float64) TypeError: float() argument must be a string or a number, not 'FeatureUnion' 将文本和数字数据合并到一个FeatureUnion矩阵中吗?

1 个答案:

答案 0 :(得分:1)

我认为你误解了FeatureUnion的工作原理。 FeatureUnion应用多个特征提取器/预处理器,并将生成的特征组合到一个矩阵中。由于您没有多个预处理器,而是有多个matricies,因此您应该使用hstack。使用numpy.hstack()它需要两个密集矩阵。如果需要稀疏,请改用scipy.sparse.hstack()