来自管道的特征矩阵

时间:2018-07-10 06:25:51

标签: python-3.x scikit-learn feature-extraction

scikit learn examples中的示例为例,该示例使用具有如下所示管线的特征联合。在执行管线之后如何获得整个特征矩阵的尺寸?

pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
    transformer_list=[

        # Pipeline for pulling features from the post's subject line
        ('subject', Pipeline([
            ('selector', ItemSelector(key='subject')),
            ('tfidf', TfidfVectorizer(min_df=50)),
        ])),

        # Pipeline for standard bag-of-words model for body
        ('body_bow', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50)),
        ])),

        # Pipeline for pulling ad hoc features from post's body
        ('body_stats', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('stats', TextStats()),  # returns a list of dicts
            ('vect', DictVectorizer()),  # list of dicts -> feature matrix
        ])),

    ],

    # weight components in FeatureUnion
    transformer_weights={
        'subject': 0.8,
        'body_bow': 0.5,
        'body_stats': 1.0,
    },
)),

# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])

1 个答案:

答案 0 :(得分:0)

FeatureUnion将仅更改数据的列,因此行数保持不变。

现在要获得管道执行后的列数,有多种方法:

1)您当前的管道使用SVC作为最后一个估计量。这不会更改数据的形状,只会适合数据。因此,您可以使用其属性来获取上一步输入到其中的要素的数量。

根据documentation,您可以使用:

  

support_vectors_:类似数组,形状= [n_SV,n_features]

第二个维度将代表输入到SVC的def createTablesFromDict(modelDict, resultNumber, groupName, filenameList): pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', -1) array=[] i=0 for value in modelDict.values(): array.append(list(value.values())) temp = array[i][0] # header = dictVal # This is needed to be able to reorder the table column values. # At least I did not find another way... array[i][0] = array[i][3] array[i][3] = array[i][2] array[i][2] = array[i][1] array[i][1] = array[i][4] array[i][4] = array[i][5] array[i][5] = array[i][6] array[i][6] = temp i += 1 df = pd.DataFrame(array, index=modelDict.keys(), columns=['Against M & S', 'W10 Support', 'BIOS Version', 'Newest BIOS', 'Computers', 'Mapped Category', 'Vendor']) cols = list(df.columns.values) logger.debug("Column values: {0}".format(cols)) df = df.fillna("") fileName = "Result{0}_{1}.html".format(resultNumber, groupName) filenameList.append(fileName) df.to_html(fileName) 。您可以通过以下方式访问它:

dictionary = {
'HexaPlex x50': {
    'Vendor': 'Dell  Inc.',
    'BIOS Version': '12.72.9',
    'Newest BIOS': '12.73.9',
    'Against M & S': 'Yes',
    'W10 Support': 'Yes',
    'Computers': {
        'someName001': '12.72.9',
        'someName002': '12.73.9',
        'someName003': '12.73.9'
    },
    'Mapped Category': ['SomeOtherCategory']
},
...

2)(更容易),您可以复制管道(离开最后一步(svc)),然后在其上调用n_features

pipeline.named_steps['svc'].support_vectors_.shape

然后

fit_transform()