用熊猫中的一列列表聚合多个分组依据

时间:2018-10-17 23:09:38

标签: python pandas

我有一个DataFrame,其子集如下所示:

{u'snId': {3: u'396321357429208',
  695: u'606426623024865',
  703: u'606426623024865',
  914: u'606426623024865',
  5097: u'606426623024865',
  6865: u'396321357429208',
  26884: u'606426623024865',
  30538: u'396321357429208',
  32152: u'606426623024865',
  34314: u'396321357429208',
  34345: u'606426623024865',
  52207: u'606426623024865',
  55361: u'396321357429208',
  59077: u'606426623024865',
  68118: u'396321357429208',
  79366: u'396321357429208',
  86798: u'606426623024865',
  130472: u'396321357429208',
  146595: u'396321357429208',
  211110: u'606426623024865',
  227155: u'396321357429208',
  240219: u'396321357429208',
  245716: u'606426623024865',
  248525: u'606426623024865',
  327256: u'606426623024865'},
 u'snMsgType': {3: u'Private',
  695: u'Private',
  703: u'Private',
  914: u'Private',
  5097: u'Private',
  6865: u'Private',
  26884: u'Private',
  30538: u'Private',
  32152: u'Private',
  34314: u'Private',
  34345: u'Private',
  52207: u'Private',
  55361: u'Private',
  59077: u'Private',
  68118: u'Private',
  79366: u'Private',
  86798: u'Private',
  130472: u'Private',
  146595: u'Private',
  211110: u'Private',
  227155: u'Private',
  240219: u'Private',
  245716: u'Private',
  248525: u'Private',
  327256: u'Private'},
 u'tagIds': {3: array([198419]),
  695: array([201340]),
  703: array([198419]),
  914: array([198421]),
  5097: array([202750]),
  6865: array([199783]),
  26884: array([198419, 202750]),
  30538: array([198382]),
  32152: array([188101]),
  34314: array([198419, 198416]),
  34345: array([198419, 201340]),
  52207: array([201340]),
  55361: array([202750]),
  59077: array([198419, 198421]),
  68118: array([198422]),
  79366: array([188101]),
  86798: array([202750]),
  130472: array([198408]),
  146595: array([198419, 188101]),
  211110: array([198419, 199783]),
  227155: array([201340]),
  240219: array([198419, 199783]),
  245716: array([199783]),
  248525: array([198419, 198416]),
  327256: array([198419, 188101])},
 u'text': {3: u"No problem!",
  695: u"If you're struggling on what other shapewear to buy, then this article will help.",
  703: u"No problem!",
  914: u"No problem!",
  5097: u"No problem!",
  6865: u"No problem!",
  26884: u"No problem!",
  30538: u"No problem!",
  32152: u"No problem!",
  34314: u"No problem!",
  34345: u"No problem!",
  52207: u"No problem!",
  55361: u"No problem!",
  59077: u"No problem!",
  68118: u"No problem!",
  79366: u"No problem!",
  86798: u"If you're struggling on what other shapewear to buy, then this article will help.",
  130472: u"No problem!",
  146595: u"No problem!",
  211110: u"No problem!",
  227155: u"No problem!",
  240219: u"No problem!",
  245716: u"No problem!",
  248525: u"No problem!",
  327256: u"No problem!"}}

我想在snIdtextsnMsgType列上进行分组,并将所有唯一的tagIds汇总到每个组的列表中。

我使用了以下内容:

df.groupby(["snId","snMsgType","text"]).agg({'tagIds': lambda x:  list(set(sum(filter(None, x), [])))})

但是,它对于整个DataFrame无效,因为有些组只有1行并且没有减少。我得到一个ValueError: Function does not reduce。面临的挑战是,无论是否减少,都要使它适用于所有小组。

我为有些困惑的问题陈述和大块文字表示歉意。

答案将如下所示:

{'snId': {0: u'396321357429208', 1: u'606426623024865', 2:u'606426623024865'},
 'snMsgType': {0: u'Private', 1: u'Private', 2:'Private'},
 'tagIds': {0: [188101,
   199783,
   198408,
   198382,
   198416,
   198419,
   198422,
   201340,
   202750],
  1: [188101, 199783, 198416, 198419, 198421, 201340, 202750]},
  2: [201340, 202750]
 'text': {0: u"No problem!",
  1: u"No problem!",
  2: u"If you're struggling on what other shapewear to buy, then this article will help."}}

2 个答案:

答案 0 :(得分:0)

IIUC,您首先需要将列表的TextView列扩展为单独的行,然后才能执行split(df, y) tagIds,其中groupby()是输入{ {1}}:

agg()

收益:

my_dict

这将返回一个MultiIndex数据帧。如果您希望以已编辑问题的格式返回结果,则可以执行以下操作:

dict

收益:

df = pd.DataFrame(my_dict)

s = df.apply(lambda x: pd.Series(x['tagIds']), axis=1).stack().reset_index(level=1, drop=True).astype(int)
s.name = 'tagIds'

df = df.drop('tagIds', axis=1).join(s)

g = df.groupby(["snId","snMsgType","text"]).agg({'tagIds': lambda x: x.tolist()})

答案 1 :(得分:0)

我用以下方法欺骗了它:

def aggr_tags(df):
    myagg = lambda s:tuple(i for i in zip(*s)) 
    # Group by snIds and aggregate all the tags ids used..
    df = df.groupby(["snId","snMsgType","text"]).agg({'tagIds': myagg})

    #convert back to lists.
    df['tagIds'] = df['tagIds'].apply(lambda x: list(set(x[0])))

    return df.reset_index()