使用merge

时间:2019-07-15 09:52:27

标签: python sql pandas numpy merge

嗨,

我有两个数据框,我想遍历第一个DF的子集并将值合并到第二个DF。

我的数据如下:

 DF1 

 product      survey_id
  X1           survey_1
  x2           survey_1
  x3           survey_2
  x4           survey_3
  x5           survey_3
  x1           survey_3
  :             : 
  x(i)         survey(j)

我的第二个DF包含相同的产品(在DF2中仅出现一次/唯一),并且我添加了一个空列以放置调查编号。

DF2

product      survey_id
 x1            nan
 x2            nan
  :             :
  :             : 
 x(i)          nan

我想要做的是为每个调查收集一个DF1子集,并将它们合并到DF2,这样,如果一个产品出现不止一次,则最新的survey_id将出现在survey_id列中:

surveys = DF1['survey_id'].unique()

for survey in surveys:
    DF2 = DF2.merge(DF1['survey_id'] == survey], how='left', on='product')

如果我对调查列表进行排序,我将能够按时间顺序合并调查数据。在此,我想在每次迭代中合并/填充Survey_id列,如果产品出现多次,则覆盖Survey_id值。

我希望将DF1的一个子集作为

  DF1[DF1['survey_id']=='survey_1']

并将所有这些数据合并到DF2。所以只要DF1和DF2中的x(i)匹配,我们就有

  DF2['survey_id'] = 'survey_1'

此循环的下一次迭代将使用其中的子集

  DF1[DF1['survey_id']=='survey_2'] 

,并且在产品匹配的地方,survey_id值将设置为“ survey_2”。如果Survey_id仍为NaN,则应将其覆盖或填写

编辑:

output 

product      survey_id
  X1           survey_3
  x2           survey_1
  x3           survey_2
  x4           survey_3
  x5           survey_3

不确定合并是否是实现此目的的最佳方法。我试图使用.loc,但这似乎也不起作用:

 DF2['survey_id'] = DF1['survey_id'].loc[DF1['product'] == DF2['substance']]

2 个答案:

答案 0 :(得分:0)

这是基于以下假设:

对于所有产品xi,我们要求Survey_j使得j最大。

>>> data = {'product':['x1','x1','x2','x2','x2'], 'survey_id':['survey_1','survey_2','survey_1', 'survey_2', 'survey_3'] } 
>>> df = pd.DataFrame(data)
>>> df
  product survey_id
0      x1  survey_1
1      x1  survey_2
2      x2  survey_1
3      x2  survey_2
4      x2  survey_3
>>> df.groupby(['product'],as_index=False)['survey_id'].max()
  product survey_id
0      x1  survey_2
1      x2  survey_3

答案 1 :(得分:0)

我希望这行得通。这个想法是,有了DF1,您可以创建一个仅带有最后一个调查ID并基于该DF2的数据框(键)。

dict1 = {'product':['x1','x2','x3','x4','x5', 'x1', 'x2', 'x3', 'x4', 'x5'],
            'survey_id':['survey_1','survey_1','survey_2', 'survey_3', 'survey_3',
            'survey_3', 'survey_4', 'survey_4', 'survey_5', 'survey_5'] }

DF1 = pd.DataFrame(dict1)

keys = DF1.drop_duplicates('product',keep = "last")

dict2 = {'product':['x1','x2','x3','x4','x5']}

DF2 = pd.DataFrame(dict2)
DF2['survey_id'] = "nan"
DF2.head()

DF2 = pd.merge(keys, DF2, how = "left")

这将产生如下的DF2:

product survey_id
0   x1  survey_3
1   x2  survey_4
2   x3  survey_4
3   x4  survey_5
4   x5  survey_5

或者简单地:

DF2 = DF1.drop_duplicates('product','last').sort_values('product').reset_index(drop=True)