我有一个DataFrame,其子集如下所示:
{u'snId': {3: u'396321357429208',
695: u'606426623024865',
703: u'606426623024865',
914: u'606426623024865',
5097: u'606426623024865',
6865: u'396321357429208',
26884: u'606426623024865',
30538: u'396321357429208',
32152: u'606426623024865',
34314: u'396321357429208',
34345: u'606426623024865',
52207: u'606426623024865',
55361: u'396321357429208',
59077: u'606426623024865',
68118: u'396321357429208',
79366: u'396321357429208',
86798: u'606426623024865',
130472: u'396321357429208',
146595: u'396321357429208',
211110: u'606426623024865',
227155: u'396321357429208',
240219: u'396321357429208',
245716: u'606426623024865',
248525: u'606426623024865',
327256: u'606426623024865'},
u'snMsgType': {3: u'Private',
695: u'Private',
703: u'Private',
914: u'Private',
5097: u'Private',
6865: u'Private',
26884: u'Private',
30538: u'Private',
32152: u'Private',
34314: u'Private',
34345: u'Private',
52207: u'Private',
55361: u'Private',
59077: u'Private',
68118: u'Private',
79366: u'Private',
86798: u'Private',
130472: u'Private',
146595: u'Private',
211110: u'Private',
227155: u'Private',
240219: u'Private',
245716: u'Private',
248525: u'Private',
327256: u'Private'},
u'tagIds': {3: array([198419]),
695: array([201340]),
703: array([198419]),
914: array([198421]),
5097: array([202750]),
6865: array([199783]),
26884: array([198419, 202750]),
30538: array([198382]),
32152: array([188101]),
34314: array([198419, 198416]),
34345: array([198419, 201340]),
52207: array([201340]),
55361: array([202750]),
59077: array([198419, 198421]),
68118: array([198422]),
79366: array([188101]),
86798: array([202750]),
130472: array([198408]),
146595: array([198419, 188101]),
211110: array([198419, 199783]),
227155: array([201340]),
240219: array([198419, 199783]),
245716: array([199783]),
248525: array([198419, 198416]),
327256: array([198419, 188101])},
u'text': {3: u"No problem!",
695: u"If you're struggling on what other shapewear to buy, then this article will help.",
703: u"No problem!",
914: u"No problem!",
5097: u"No problem!",
6865: u"No problem!",
26884: u"No problem!",
30538: u"No problem!",
32152: u"No problem!",
34314: u"No problem!",
34345: u"No problem!",
52207: u"No problem!",
55361: u"No problem!",
59077: u"No problem!",
68118: u"No problem!",
79366: u"No problem!",
86798: u"If you're struggling on what other shapewear to buy, then this article will help.",
130472: u"No problem!",
146595: u"No problem!",
211110: u"No problem!",
227155: u"No problem!",
240219: u"No problem!",
245716: u"No problem!",
248525: u"No problem!",
327256: u"No problem!"}}
我想在snId
,text
和snMsgType
列上进行分组,并将所有唯一的tagIds
汇总到每个组的列表中。
我使用了以下内容:
df.groupby(["snId","snMsgType","text"]).agg({'tagIds': lambda x: list(set(sum(filter(None, x), [])))})
但是,它对于整个DataFrame无效,因为有些组只有1行并且没有减少。我得到一个ValueError: Function does not reduce
。面临的挑战是,无论是否减少,都要使它适用于所有小组。
我为有些困惑的问题陈述和大块文字表示歉意。
答案将如下所示:
{'snId': {0: u'396321357429208', 1: u'606426623024865', 2:u'606426623024865'},
'snMsgType': {0: u'Private', 1: u'Private', 2:'Private'},
'tagIds': {0: [188101,
199783,
198408,
198382,
198416,
198419,
198422,
201340,
202750],
1: [188101, 199783, 198416, 198419, 198421, 201340, 202750]},
2: [201340, 202750]
'text': {0: u"No problem!",
1: u"No problem!",
2: u"If you're struggling on what other shapewear to buy, then this article will help."}}
答案 0 :(得分:0)
IIUC,您首先需要将列表的TextView
列扩展为单独的行,然后才能执行split(df, y)
和tagIds
,其中groupby()
是输入{ {1}}:
agg()
收益:
my_dict
这将返回一个MultiIndex数据帧。如果您希望以已编辑问题的格式返回结果,则可以执行以下操作:
dict
收益:
df = pd.DataFrame(my_dict)
s = df.apply(lambda x: pd.Series(x['tagIds']), axis=1).stack().reset_index(level=1, drop=True).astype(int)
s.name = 'tagIds'
df = df.drop('tagIds', axis=1).join(s)
g = df.groupby(["snId","snMsgType","text"]).agg({'tagIds': lambda x: x.tolist()})
答案 1 :(得分:0)
我用以下方法欺骗了它:
def aggr_tags(df):
myagg = lambda s:tuple(i for i in zip(*s))
# Group by snIds and aggregate all the tags ids used..
df = df.groupby(["snId","snMsgType","text"]).agg({'tagIds': myagg})
#convert back to lists.
df['tagIds'] = df['tagIds'].apply(lambda x: list(set(x[0])))
return df.reset_index()