Pandas Dataframe中的数据操作为每个组添加行

时间:2019-07-24 12:16:25

标签: python pandas data-manipulation

我想对以下数据进行数据处理。我要在下面为每个经理添加另一行,其中经理和工人是相同的。 我该怎么办?

不是:对于经理来说,一切都一样。这只是我的数据集的示例场景 谢谢。

   data = [['Tom','Aurora',4500,'Shelly','Chicago',43553]
    ,['Tom','Aurora',4500,'Alex','NewYork',43654]
    ,['Tom','Aurora',4500,'Kelly','Cincinnati',44674]
    ,['Jason','Charlotte',4567,'Jimmy','Boston',44984]
    ,['Jason','Charlotte',4567,'Aaron','Austin',44583]
   ]

   # Create the pandas DataFrame 
   df = pd.DataFrame(data, columns = ['Manager','Managercity', 
   'manager_id','Worker','WorkerCity','Worker_id']) 

   # print dataframe. 
   print(df) 

下面所需的数据集

 Manager Managercity  manager_id  Worker  WorkerCity  Worker_id
    Tom      Aurora        4500  Shelly     Chicago      43553
    Tom      Aurora        4500    Alex     NewYork      43654
    Tom      Aurora        4500   Kelly  Cincinnati      44674
    Tom      Aurora        4500     Tom      Aurora       4500
  Jason   Charlotte        4567   Jimmy      Boston      44984
  Jason   Charlotte        4567   Aaron      Austin      44583
  Jason   Charlotte        4567   Jason   Charlotte       4567

谢谢

2 个答案:

答案 0 :(得分:1)

尝试:

def add(gr):
    new_row = gr.iloc[0,:]
    new_row['Worker'] = new_row['Manager']
    new_row['Worker_id'] = new_row['manager_id']
    return gr.append(new_row)
df = df.groupby('Manager').apply(add).reset_index(drop = True)

您的样本数据不包含ManagerCity,但您也可以在添加函数上使用new_row['Worker_city'] = new_row['Manager_city']进行设置。

答案 1 :(得分:1)

您可以像这样使用pd.concatdrop duplicates

data = [['Tom','Aurora',4500,'Shelly','Chicago',43553]
    ,['Tom','Aurora',4500,'Alex','NewYork',43654]
    ,['Tom','Aurora',4500,'Kelly','Cincinnati',44674]
    ,['Jason','Charlotte',4567,'Jimmy','Boston',44984]
    ,['Jason','Charlotte',4567,'Aaron','Austin',44583]
   ]

   # Create the pandas DataFrame 
df_in = pd.DataFrame(data, columns = ['Manager','Managercity', 'manager_id','Worker','WorkerCity','Worker_id']) 

df_managers = pd.DataFrame(np.tile(df_in[['Manager','Managercity','manager_id']].drop_duplicates(),2),columns=df_in.columns)
df_out = pd.concat([df_in, df_managers]).sort_values('Manager').reset_index(drop=True)
print(df_out)

输出:

  Manager Managercity manager_id  Worker  WorkerCity Worker_id
0   Jason   Charlotte       4567   Jimmy      Boston     44984
1   Jason   Charlotte       4567   Aaron      Austin     44583
2   Jason   Charlotte       4567   Jason   Charlotte      4567
3     Tom      Aurora       4500  Shelly     Chicago     43553
4     Tom      Aurora       4500    Alex     NewYork     43654
5     Tom      Aurora       4500   Kelly  Cincinnati     44674
6     Tom      Aurora       4500     Tom      Aurora      4500