我有一个类似于以下的pandas数据框。我正在尝试合并所有包含相同ID和CountryCode值对的行。
records = [ (1, 'IN', 'yes' , '', '' , '', '') ,
(1, 'MY', '' , 'yes', '' , '', '' ) ,
(1, 'MY', '' , '', 'yes', '', '' ) ,
(1, 'MY', '' , '' , '' , 'yes', '') ,
(1, 'US', '' , '', '' , '', 'yes') ,
(2, 'MY', 'yes' , '', '' , '', ''),
(2, 'UK', '' , 'yes', '' , '', '')]
dfRecords = pd.DataFrame(records, columns = ['ID' , 'CountryCode', 'Address' , 'MobileNo', 'HomeNo', 'OfficeNo', 'TacNo'])
输出:
ID CountryCode Address MobileNo HomeNo OfficeNo TacNo
1 IN yes
1 MY yes
1 MY yes
1 MY yes
1 US yes
2 MY yes
2 UK yes
这就是我需要的
ID CountryCode Address MobileNo HomeNo OfficeNo TacNo
1 IN yes
1 MY yes yes yes
1 US yes
2 MY yes
2 UK yes
我有一个想法,我必须根据ID和CountryCode列使用groupby(),但无法将行合并在一起。
groupings = dfRecords.groupby(['ID','CountryCode'])
groupings.groups
输出:
{(1, 'IN'): Int64Index([0], dtype='int64'),
(1, 'MY'): Int64Index([1, 2, 3], dtype='int64'),
(1, 'US'): Int64Index([4], dtype='int64'),
(2, 'MY'): Int64Index([5], dtype='int64'),
(2, 'UK'): Int64Index([6], dtype='int64')}
答案 0 :(得分:2)
max
因为'yes'
大于''
dfRecords.groupby(['ID', 'CountryCode'], as_index=False).max()
ID CountryCode Address MobileNo HomeNo OfficeNo TacNo
0 1 IN yes
1 1 MY yes yes yes
2 1 US yes
3 2 MY yes
4 2 UK yes
first
无需依赖max
g = dfRecords.mask(dfRecords == '').groupby(['ID', 'CountryCode'], as_index=False)
g.first().fillna('')
ID CountryCode Address MobileNo HomeNo OfficeNo TacNo
0 1 IN yes
1 1 MY yes yes yes
2 1 US yes
3 2 MY yes
4 2 UK yes