我正在努力尝试将数据框拆分为3个新的数据框,当供应商名称更改时会在其中拆分。我搜索了现有问题。 How to split dataframe on based on columns row和Pandas & python: split dataframe into many dataframes based on column value containing substring接近,但我无法获得所需的输出。
在此附上玩具数据集以说明我的问题:
df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})
我尝试了(失败了)
df1 = df.iloc[:df.index[df['Supplier'] == 'Supplier1'].tolist()[0]]
df2 = df.iloc[df.index[df['Supplier'] == 'Supplier2'].tolist()[0]+1:]
df3 = df.iloc[df.index[df['Supplier'] == 'Supplier3'].tolist()[0]+1:]
我想要达到的结果是:
Supplier Class
0 Supplier1 A
1 Supplier1 A
Supplier Class
0 Supplier2 A
1 Supplier2 A
2 Supplier2 A
Supplier Class
0 Supplier3 B
1 Supplier3 B
在此方面的任何帮助将不胜感激。谢谢!
更新: 使用:
df1 = {i:group for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}
给予:
{1: Supplier Class
0 Supplier1 A
1 Supplier1 A, 2: Supplier Class
2 Supplier2 A
3 Supplier2 A
4 Supplier2 A, 3: Supplier Class
5 Supplier3 B
6 Supplier3 B}
我需要拆分为单独的数据框,所以我这样做了:
df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])
但是它给出了错误
df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])
AttributeError: 'dict' object has no attribute 'groupby'
答案 0 :(得分:3)
为唯一供应商值创建数据框:
dict(zip(df.groupby('Supplier')))
创建数据框每次supplier
列中的值更改:
dfs = {i:group.reset_index(drop=True)
for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}
更新
获得三个单独的数据框与最终要使用 pd.DataFrame(..)
不兼容,这显然会创建一个唯一的数据框,因此我的解决方案是创建一个数据框字典,其中每个字典都是访问1到n的整数值。我们只需执行以下操作即可为每个索引重置索引:
{i:group.reset_index(drop=True) for i,group in df.groupby( df['supplier'].ne(df['supplier'].shift()).cumsum() )}
我们可以使用pd.concat
按照@ anky_91的建议
dfs_concat = pd.concat([group.reset_index(drop=True)
for _,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift())
.cumsum() )])
print(dfs_concat)
Supplier Class
0 Supplier1 A
1 Supplier1 A
0 Supplier2 A
1 Supplier2 A
2 Supplier2 A
0 Supplier3 B
1 Supplier3 B
但是如果要寻求后者,我们可以简单地使用groupby.cumcount
df.index = df.groupby(df['Supplier'].ne(df['Supplier'].shift()).cumsum()).cumcount()
print(df)
Supplier Class
0 Supplier1 A
1 Supplier1 A
0 Supplier2 A
1 Supplier2 A
2 Supplier2 A
0 Supplier3 B
1 Supplier3 B
答案 1 :(得分:0)
尝试一下,
df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})
df1 = df[df.Supplier=='Supplier1']
df2 = df[df.Supplier=='Supplier2']
df3 = df[df.Supplier=='Supplier3']
或者你可以做,
new_df=df.pivot(columns='Supplier')
如果有“供应商”,则可以获取许多列。
输出:
Supplier Supplier1 Supplier2 Supplier3
0 A NaN NaN
1 A NaN NaN
2 NaN A NaN
3 NaN A NaN
4 NaN A NaN
5 NaN NaN B
6 NaN NaN B
答案 2 :(得分:0)
我相信这可以实现您想要的拆分:
groups = [group.reset_index()[['Supplier', 'Class']] for _, group in df.groupby('Supplier')]
您可以通过以下方式获得示例的准确输出
for group in groups:
print(group)
输出:
Supplier Class
0 Supplier1 A
1 Supplier1 A
Supplier Class
0 Supplier2 A
1 Supplier2 A
2 Supplier2 A
Supplier Class
0 Supplier3 B
1 Supplier3 B