嗨,
我有两个数据框,我想遍历第一个DF的子集并将值合并到第二个DF。
我的数据如下:
DF1
product survey_id
X1 survey_1
x2 survey_1
x3 survey_2
x4 survey_3
x5 survey_3
x1 survey_3
: :
x(i) survey(j)
我的第二个DF包含相同的产品(在DF2中仅出现一次/唯一),并且我添加了一个空列以放置调查编号。
DF2
product survey_id
x1 nan
x2 nan
: :
: :
x(i) nan
我想要做的是为每个调查收集一个DF1子集,并将它们合并到DF2,这样,如果一个产品出现不止一次,则最新的survey_id将出现在survey_id列中:
surveys = DF1['survey_id'].unique()
for survey in surveys:
DF2 = DF2.merge(DF1['survey_id'] == survey], how='left', on='product')
如果我对调查列表进行排序,我将能够按时间顺序合并调查数据。在此,我想在每次迭代中合并/填充Survey_id列,如果产品出现多次,则覆盖Survey_id值。
我希望将DF1的一个子集作为
DF1[DF1['survey_id']=='survey_1']
并将所有这些数据合并到DF2。所以只要DF1和DF2中的x(i)匹配,我们就有
DF2['survey_id'] = 'survey_1'
此循环的下一次迭代将使用其中的子集
DF1[DF1['survey_id']=='survey_2']
,并且在产品匹配的地方,survey_id值将设置为“ survey_2”。如果Survey_id仍为NaN,则应将其覆盖或填写
编辑:
output
product survey_id
X1 survey_3
x2 survey_1
x3 survey_2
x4 survey_3
x5 survey_3
不确定合并是否是实现此目的的最佳方法。我试图使用.loc,但这似乎也不起作用:
DF2['survey_id'] = DF1['survey_id'].loc[DF1['product'] == DF2['substance']]
答案 0 :(得分:0)
这是基于以下假设:
对于所有产品xi,我们要求Survey_j使得j最大。
>>> data = {'product':['x1','x1','x2','x2','x2'], 'survey_id':['survey_1','survey_2','survey_1', 'survey_2', 'survey_3'] }
>>> df = pd.DataFrame(data)
>>> df
product survey_id
0 x1 survey_1
1 x1 survey_2
2 x2 survey_1
3 x2 survey_2
4 x2 survey_3
>>> df.groupby(['product'],as_index=False)['survey_id'].max()
product survey_id
0 x1 survey_2
1 x2 survey_3
答案 1 :(得分:0)
我希望这行得通。这个想法是,有了DF1,您可以创建一个仅带有最后一个调查ID并基于该DF2的数据框(键)。
dict1 = {'product':['x1','x2','x3','x4','x5', 'x1', 'x2', 'x3', 'x4', 'x5'],
'survey_id':['survey_1','survey_1','survey_2', 'survey_3', 'survey_3',
'survey_3', 'survey_4', 'survey_4', 'survey_5', 'survey_5'] }
DF1 = pd.DataFrame(dict1)
keys = DF1.drop_duplicates('product',keep = "last")
dict2 = {'product':['x1','x2','x3','x4','x5']}
DF2 = pd.DataFrame(dict2)
DF2['survey_id'] = "nan"
DF2.head()
DF2 = pd.merge(keys, DF2, how = "left")
这将产生如下的DF2:
product survey_id
0 x1 survey_3
1 x2 survey_4
2 x3 survey_4
3 x4 survey_5
4 x5 survey_5
或者简单地:
DF2 = DF1.drop_duplicates('product','last').sort_values('product').reset_index(drop=True)