在Pandas中形成稀疏特征矩阵数据帧

时间:2015-12-02 21:01:53

标签: python pandas

我想扩展此数据框的“功能”列,以便创建一个新的数据框,其中这些功能成为列名。

例如。从此,

Raw matrix

对此,

Features matrix

我的解决方案有效,但我认为它不是很好,因为有很多for循环。也许有一种更好的方法可以利用Pandas.DataFrame类的功能?

生成特征矩阵的代码如下,

def feature_data_frame_by_exploding_column(input_df, col_name):

    # Create data frame with same columns minus the column you want to explode
    df = input_df.copy()
    del df[col_name]

    # The items that you want to become new features
    all_new_features = []
    new_feature_list = input_df[col_name].values
    for ingred_list in new_feature_list:
        all_new_features.extend(ingred_list) # Extend vs append!

    # Add new features as columns of zeros
    for feature in all_new_features:
        df[feature] = 0

    # For each row in data frame set values that need to be 1
    for index in df.index:
        ingreds_arr = new_feature_list[index]
        df.loc[index, ingreds_arr] = 1

    return df

df = pd.DataFrame(columns = ["id", "features"])
df['id'] = [0,1]
df['features'] = [["A", "B"], ["C", "D"]]
df

feature_data_frame_by_exploding_column(df,"features")

1 个答案:

答案 0 :(得分:1)

Scikit learn' s MultiLabelBinarizer根据标签创建二进制矩阵。您可以从pandas dataframe中提取=DATEDIF(M9,G19,"YM") =IF(M9>G19,Yes,No) 列并应用它:

feature

此外,通过指定mlb = MultiLabelBinarizer() new_array = mlb.fit_transform(feature) ,您将获得真正稀疏的输出(如果不同要素的数量很大,则非常有用)。

示例输出:

MultiLabelBinarizer(sparse_output=True)