根据字符串列值拆分熊猫数据框

时间:2020-01-01 16:37:26

标签: python pandas

我正在努力尝试将数据框拆分为3个新的数据框,当供应商名称更改时会在其中拆分。我搜索了现有问题。 How to split dataframe on based on columns rowPandas & python: split dataframe into many dataframes based on column value containing substring接近,但我无法获得所需的输出。

在此附上玩具数据集以说明我的问题:

df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})

我尝试了(失败了)

df1 = df.iloc[:df.index[df['Supplier'] == 'Supplier1'].tolist()[0]]
df2 = df.iloc[df.index[df['Supplier'] == 'Supplier2'].tolist()[0]+1:]
df3 = df.iloc[df.index[df['Supplier'] == 'Supplier3'].tolist()[0]+1:]

我想要达到的结果是:

   Supplier Class
0  Supplier1     A
1  Supplier1     A
    Supplier Class
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
    Supplier Class
0  Supplier3     B
1  Supplier3     B

在此方面的任何帮助将不胜感激。谢谢!

更新: 使用:

df1 = {i:group for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}

给予:

{1:     Supplier Class
0  Supplier1     A
1  Supplier1     A, 2:     Supplier Class
2  Supplier2     A
3  Supplier2     A
4  Supplier2     A, 3:     Supplier Class
5  Supplier3     B
6  Supplier3     B}

我需要拆分为单独的数据框,所以我这样做了:

df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])

但是它给出了错误

 df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])
AttributeError: 'dict' object has no attribute 'groupby'

3 个答案:

答案 0 :(得分:3)

唯一供应商值创建数据框:

dict(zip(df.groupby('Supplier')))

创建数据框每次supplier列中的值更改

dfs = {i:group.reset_index(drop=True) 
       for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}

更新

获得三个单独的数据框与最终要使用 pd.DataFrame(..) 不兼容,这显然会创建一个唯一的数据框,因此我的解决方案是创建一个数据框字典,其中每个字典都是访问1到n的整数值。我们只需执行以下操作即可为每个索引重置索引:

{i:group.reset_index(drop=True) for i,group in df.groupby( df['supplier'].ne(df['supplier'].shift()).cumsum() )}

我们可以使用pd.concat 按照@ anky_91的建议

来获取每次提供者列中的值都发生更改时恢复索引的单个数据帧。
dfs_concat = pd.concat([group.reset_index(drop=True) 
                        for _,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift())
                                                                 .cumsum() )])
print(dfs_concat)

    Supplier Class
0  Supplier1     A
1  Supplier1     A
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
0  Supplier3     B
1  Supplier3     B

但是如果要寻求后者,我们可以简单地使用groupby.cumcount

df.index = df.groupby(df['Supplier'].ne(df['Supplier'].shift()).cumsum()).cumcount()
print(df)

  Supplier Class
0  Supplier1     A
1  Supplier1     A
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
0  Supplier3     B
1  Supplier3     B

答案 1 :(得分:0)

尝试一下,

df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})


df1 = df[df.Supplier=='Supplier1']
df2 = df[df.Supplier=='Supplier2']
df3 = df[df.Supplier=='Supplier3']

或者你可以做,

new_df=df.pivot(columns='Supplier')

如果有“供应商”,则可以获取许多列。

输出:

Supplier Supplier1 Supplier2 Supplier3
0                A       NaN       NaN
1                A       NaN       NaN
2              NaN         A       NaN
3              NaN         A       NaN
4              NaN         A       NaN
5              NaN       NaN         B
6              NaN       NaN         B

答案 2 :(得分:0)

我相信这可以实现您想要的拆分:

groups = [group.reset_index()[['Supplier', 'Class']] for _, group in df.groupby('Supplier')]

您可以通过以下方式获得示例的准确输出

for group in groups:
    print(group)

输出:

    Supplier Class
0  Supplier1     A
1  Supplier1     A
    Supplier Class
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
    Supplier Class
0  Supplier3     B
1  Supplier3     B