Question

我有2个数据帧，分别是分组和引用

假设这是我的df_group

label groupId 
1       123
2       124
3       125
4       126
5       127

和df_cite

groupId new_group
123       96
124       96
125       96
123       97
124       99
124       98
125       98
126       97
127       99

我希望看到新的df_group结果

df_group（新）

label groudId new_group
1      123     96
2      123     97
3      124     96
4      124     98
5      124     99
6      125     96
7      125     98
8      126     97
9      127     99

我尝试了test_out = df_group.merge(df_cite, left_on='groupId', right_on='groupId')和df_group = df_group.join(df_cite.set_index('groupId'), on=['PatNumgroupId'])，但都没有用。

在此Python: how to merge two dataframes on a column by keeping the information of the first one?之前，我遵循了，但是得到了InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Answer 1

我认为您需要使用cumcount创建帮助程序列以计数器重复值，并使用groupId和g的左侧联接来创建merge，最后按以下方式删除帮助程序列： drop：

对于merge，需要相同类型的合并列，因此有可能将它们都转换为整数或都转换为字符串

#solution 1 
df_group['groupId'] = df_group['groupId'].astype(int)
#solution 2
#df_cite['groupId'] = df_cite['groupId'].astype(str)

df_group['g'] = df_group.groupby('groupId').cumcount()
df_cite['g'] = df_cite.groupby('groupId').cumcount()

test_out = df_group.merge(df_cite, on=['groupId','g'], how='left').drop('g', axis=1)
print (test_out)
   label  groupId  new_group
0      1      123         96
1      2      124         96
2      3      125         96
3      4      126         97
4      5      127         99

Answer 2

您可能想做：

df_cite = df_cit.reset_index(drop = False)

和

df_group = df_group.reset_index(drop = False)

在每个数据帧上设置一个新索引。从您的问题尚不清楚df是否具有“常规”索引，或者您是否已将索引设置为其中一列。

如果是第二种情况，则在执行合并时，该命令未找到该列，因为它是索引。

这是带有“正常”索引的数据框的外观：

    label   groupId
0   1   123
1   2   124
2   3   125
3   4   126
4   5   127

与您的问题相比，上面的df左侧有一个“额外”列。那就是索引。就您而言，看来“标签”是索引的名称，而不是df_group中的列。

似乎在每个df中，您的groupId可能都具有不同的类型（对象和整数-int）。您想使用df_cite.info（）和df_group.info（）进行检查。如果它们是列，则它们应该显示在列表中，并且应该具有相同的数据类型：

    df_cite.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 9 entries, 0 to 8
    Data columns (total 2 columns):
    groupId      9 non-null int64
    new_group    9 non-null int64
    dtypes: int64(2)
    memory usage: 224.0 bytes

在这种情况下，groupId是一个整数（int64）

python-熊猫，如何将相同的值映射到不同的数据框？

2 个答案: