Question

嗨，

我有两个数据框，我想遍历第一个DF的子集并将值合并到第二个DF。

我的数据如下：

 DF1 

 product      survey_id
  X1           survey_1
  x2           survey_1
  x3           survey_2
  x4           survey_3
  x5           survey_3
  x1           survey_3
  :             : 
  x(i)         survey(j)

我的第二个DF包含相同的产品（在DF2中仅出现一次/唯一），并且我添加了一个空列以放置调查编号。

DF2

product      survey_id
 x1            nan
 x2            nan
  :             :
  :             : 
 x(i)          nan

我想要做的是为每个调查收集一个DF1子集，并将它们合并到DF2，这样，如果一个产品出现不止一次，则最新的survey_id将出现在survey_id列中：

surveys = DF1['survey_id'].unique()

for survey in surveys:
    DF2 = DF2.merge(DF1['survey_id'] == survey], how='left', on='product')

如果我对调查列表进行排序，我将能够按时间顺序合并调查数据。在此，我想在每次迭代中合并/填充Survey_id列，如果产品出现多次，则覆盖Survey_id值。

我希望将DF1的一个子集作为

  DF1[DF1['survey_id']=='survey_1']

并将所有这些数据合并到DF2。所以只要DF1和DF2中的x（i）匹配，我们就有

  DF2['survey_id'] = 'survey_1'

此循环的下一次迭代将使用其中的子集

  DF1[DF1['survey_id']=='survey_2']

，并且在产品匹配的地方，survey_id值将设置为“ survey_2”。如果Survey_id仍为NaN，则应将其覆盖或填写

编辑：

output 

product      survey_id
  X1           survey_3
  x2           survey_1
  x3           survey_2
  x4           survey_3
  x5           survey_3

不确定合并是否是实现此目的的最佳方法。我试图使用.loc，但这似乎也不起作用：

 DF2['survey_id'] = DF1['survey_id'].loc[DF1['product'] == DF2['substance']]

Answer 1

这是基于以下假设：

对于所有产品xi，我们要求Survey_j使得j最大。

>>> data = {'product':['x1','x1','x2','x2','x2'], 'survey_id':['survey_1','survey_2','survey_1', 'survey_2', 'survey_3'] } 
>>> df = pd.DataFrame(data)
>>> df
  product survey_id
0      x1  survey_1
1      x1  survey_2
2      x2  survey_1
3      x2  survey_2
4      x2  survey_3
>>> df.groupby(['product'],as_index=False)['survey_id'].max()
  product survey_id
0      x1  survey_2
1      x2  survey_3

Answer 2

我希望这行得通。这个想法是，有了DF1，您可以创建一个仅带有最后一个调查ID并基于该DF2的数据框（键）。

dict1 = {'product':['x1','x2','x3','x4','x5', 'x1', 'x2', 'x3', 'x4', 'x5'],
            'survey_id':['survey_1','survey_1','survey_2', 'survey_3', 'survey_3',
            'survey_3', 'survey_4', 'survey_4', 'survey_5', 'survey_5'] }

DF1 = pd.DataFrame(dict1)

keys = DF1.drop_duplicates('product',keep = "last")

dict2 = {'product':['x1','x2','x3','x4','x5']}

DF2 = pd.DataFrame(dict2)
DF2['survey_id'] = "nan"
DF2.head()

DF2 = pd.merge(keys, DF2, how = "left")

这将产生如下的DF2：

product survey_id
0   x1  survey_3
1   x2  survey_4
2   x3  survey_4
3   x4  survey_5
4   x5  survey_5

或者简单地：

DF2 = DF1.drop_duplicates('product','last').sort_values('product').reset_index(drop=True)

使用merge

2 个答案: