以scikit learn examples中的示例为例,该示例使用具有如下所示管线的特征联合。在执行管线之后如何获得整个特征矩阵的尺寸?
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])),
# Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])),
# Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])),
],
# weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)),
# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])
答案 0 :(得分:0)
FeatureUnion将仅更改数据的列,因此行数保持不变。
现在要获得管道执行后的列数,有多种方法:
1)您当前的管道使用SVC作为最后一个估计量。这不会更改数据的形状,只会适合数据。因此,您可以使用其属性来获取上一步输入到其中的要素的数量。
根据documentation,您可以使用:
support_vectors_:类似数组,形状= [n_SV,n_features]
第二个维度将代表输入到SVC的def createTablesFromDict(modelDict, resultNumber, groupName, filenameList):
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
array=[]
i=0
for value in modelDict.values():
array.append(list(value.values()))
temp = array[i][0]
# header = dictVal
# This is needed to be able to reorder the table column values.
# At least I did not find another way...
array[i][0] = array[i][3]
array[i][3] = array[i][2]
array[i][2] = array[i][1]
array[i][1] = array[i][4]
array[i][4] = array[i][5]
array[i][5] = array[i][6]
array[i][6] = temp
i += 1
df = pd.DataFrame(array, index=modelDict.keys(), columns=['Against M & S', 'W10 Support', 'BIOS Version', 'Newest BIOS', 'Computers', 'Mapped Category', 'Vendor'])
cols = list(df.columns.values)
logger.debug("Column values: {0}".format(cols))
df = df.fillna("")
fileName = "Result{0}_{1}.html".format(resultNumber, groupName)
filenameList.append(fileName)
df.to_html(fileName)
。您可以通过以下方式访问它:
dictionary = {
'HexaPlex x50': {
'Vendor': 'Dell Inc.',
'BIOS Version': '12.72.9',
'Newest BIOS': '12.73.9',
'Against M & S': 'Yes',
'W10 Support': 'Yes',
'Computers': {
'someName001': '12.72.9',
'someName002': '12.73.9',
'someName003': '12.73.9'
},
'Mapped Category': ['SomeOtherCategory']
},
...
2)(更容易),您可以复制管道(离开最后一步(svc)),然后在其上调用n_features
。
pipeline.named_steps['svc'].support_vectors_.shape
然后
fit_transform()