根据两列熊猫进行分组后找到重复的行

时间:2019-09-29 20:59:17

标签: python python-3.x pandas

我想基于col1和col2进行分组,并在组之间基于col2查找重复的行。

输入

 col1            col2         col3
    A               0            2.0
    A               0            1.0
    A               0            3.0
    A               1            3.0
    A               1            4.0
    A               3            9.0
    B               0            3.0
    B               1            1.0
    B               1            1.0
    B               2            3.0
    C               2            4.0
    C               3            5.0
    C               1            6.0
    C               1            2.0
    C               4            3.0

预期输出:

0 in A , B
1 in A , B , C
2 in B , C
3 in A , C
4 in C

3 个答案:

答案 0 :(得分:2)

尝试GroupBy.unique,然后在以下位置加入字符串

df.groupby('col2')['col1'].unique().str.join(', ')

col2
0       A, B
1    A, B, C
2       B, C
3       A, C
4          C
Name: col1, dtype: object

(df.groupby('col2')['col1']
   .unique()
   .str.join(', ')
   .to_frame()
   .apply(lambda x: f'{x.name} in {x[0]}', axis=1))

col2
0       0 in A, B
1    1 in A, B, C
2       2 in B, C
3       3 in A, C
4          4 in C
dtype: object

答案 1 :(得分:0)

您可以执行以下操作:

aggregated=df.groupby(['col2']).agg({'col1': 'unique'})

输出如下:

col2           
0        [A, B]
1     [A, B, C]
2        [B, C]
3        [A, C]
4           [C]

如果您要像示例中那样设置格式,则可以执行:

aggregated.reset_index().apply('{0.col2} in {0.col1}'.format, axis='columns')

所以看起来像这样:

0        0 in ['A' 'B']
1    1 in ['A' 'B' 'C']
2        2 in ['B' 'C']
3        3 in ['A' 'C']
4            4 in ['C']

答案 2 :(得分:0)

select 
       '2019-10-24 12:07:24.567'::timestamp_ntz as orig
       ,TIMESTAMP_TZ_FROM_PARTS( year(orig), month(orig),day(orig), hour(orig), minute(orig), second(orig) , date_part(nanosecond, orig), 'UTC' ) 
       ,TO_TIMESTAMP_TZ(orig::varchar || ' +0000')
;

--
2019-10-24 12:07:24.567        -- TIMESTAMP_NTZ (UTC)
2019-10-24 12:07:24.567 +0000  -- TIMESTAMP_TZ  (UTC)
2019-10-24 12:07:24.567 +0000  -- TIMESTAMP_TZ  (UTC)