将元素为字典的列拆分为多个列

时间:2014-11-05 09:54:18

标签: python pandas

我有一个包含字典作为元素的单个列的pandas DataFrame。它是以下代码的结果:

dg # is a pandas dataframe with columns ID and VALUE. Many rows contain the same ID

def seriesFeatures(series):
    """This functions receives a series of VALUE for the same ID and extracts
    tens of complex features from the series, storing them into a dictionary"""
    dico = dict()
    dico['feature1'] = calculateFeature1
    dico['feature2'] = calculateFeature2
    # Many more features
    dico['feature50'] = calculateFeature50
    return dico

grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg( { 'all_features' : lambda s: seriesFeatures(s) } )
dh.reset_index()
# Here I get a dh DataFrame of a single column 'all_features' and
# dictionaries stored on its values. The keys are the feature's names

我需要以有效的方式将此'all_features'列拆分为尽可能多的列(我有太多的行和列,我无法更改seriesFeatures函数),所以输出将是包含IDFEATURE1FEATURE2FEATURE3,...,FEATURE50列的数据框。最好的方法是什么?

修改

一个具体而简单的例子:

dg = pd.DataFrame( [ [1,10] , [1,15] , [1,13] , [2,14] , [2,16] ] , columns=['ID','VALUE'] )

def seriesFeatures(series):
    dico = dict()
    dico['feature1'] = len(series)
    dico['feature2'] = series.sum()
    return dico

grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg( { 'all_features' : lambda s: seriesFeatures(s) } )
dh.reset_index()

但是当我尝试用pd.Series或pd.DataFrame包装它时,它表示如果数据是标量值,则必须提供索引。提供index = [' feature1',' feature2'],我得到了奇怪的结果,例如使用:dh = grouped['VALUE'].agg( { 'all_features' : lambda s: pd.DataFrame( seriesFeatures(s) , index=['feature1','feature2'] ) } )

1 个答案:

答案 0 :(得分:1)

我认为你应该在一个系列中包装dict,然后这将在groupby调用中扩展(但随后使用apply而不是agg因为它不是聚合(标量)结果了):

dh = grouped['VALUE'].aply(lambda s: pd.Series(seriesFeatures(s)))

之后,您可以将结果重新整形为所需的格式。

通过简单的示例案例,这似乎有效:

In [22]: dh = grouped['VALUE'].apply(lambda x: pd.Series(seriesFeatures(x)))
In [23]: dh

Out[23]:
ID
1   feature1     3
    feature2    38
2   feature1     2
    feature2    30
dtype: int64

In [26]: dh.unstack().reset_index()
Out[26]:
   ID  feature1  feature2
0   1         3        38
1   2         2        30