我有一个熊猫DataFrame,我正在按列['client','product','data']进行分组。
grouped_data = raw_data.groupby(['client', 'product', 'data'])
print(len(grouped_data))
# 10000
我想将生成的groupby对象分成两个块,一个包含大约80%的组,另一个包含其余的组。
我已经把头撞在屏幕上了一段时间了...
答案 0 :(得分:2)
通过使用np.split
df['key']=df[['client', 'product', 'data']].apply(tuple,1)
g1,g2=np.split(df['key'].unique(),[2000])
df1=df[df['key'].isin(g1)]
df2=df[df['key'].isin(g2)]
答案 1 :(得分:0)
您可以执行以下操作:
grouped = df.groupby('Client')
bound = int(np.ceil(len(grouped)*0.8))-1
chunk1 = [g[1] for g in list(grouped)[:bound]]
chunk2 = [g[1] for g in list(grouped)[bound:]]
对于以下示例数据框:
Client Product Data
0 Client1 ProductA Data1
1 Client2 ProductA Data3
2 Client3 ProductB Data1
3 Client4 ProductA Data2
4 Client5 ProductB Data1
5 Client2 ProductA Data1
6 Client3 ProductA Data3
7 Client2 ProductB Data1
8 Client3 ProductB Data1
9 Client5 ProductA Data2
10 Client1 ProductA Data1
11 Client1 ProductB Data1
12 Client4 ProductA Data2
13 Client3 ProductB Data2
14 Client2 ProductB Data3
chunk1
将产生:
Client Product Data
0 Client1 ProductA Data1
10 Client1 ProductA Data1
11 Client1 ProductB Data1
Client Product Data
1 Client2 ProductA Data3
5 Client2 ProductA Data1
7 Client2 ProductB Data1
14 Client2 ProductB Data3
Client Product Data
2 Client3 ProductB Data1
6 Client3 ProductA Data3
8 Client3 ProductB Data1
13 Client3 ProductB Data2
chunk2
将产生:
Client Product Data
3 Client4 ProductA Data2
12 Client4 ProductA Data2
Client Product Data
4 Client5 ProductB Data1
9 Client5 ProductA Data2