我想基于col1和col2进行分组,并在组之间基于col2查找重复的行。
输入
col1 col2 col3
A 0 2.0
A 0 1.0
A 0 3.0
A 1 3.0
A 1 4.0
A 3 9.0
B 0 3.0
B 1 1.0
B 1 1.0
B 2 3.0
C 2 4.0
C 3 5.0
C 1 6.0
C 1 2.0
C 4 3.0
预期输出:
0 in A , B
1 in A , B , C
2 in B , C
3 in A , C
4 in C
答案 0 :(得分:2)
尝试GroupBy.unique
,然后在以下位置加入字符串
df.groupby('col2')['col1'].unique().str.join(', ')
col2
0 A, B
1 A, B, C
2 B, C
3 A, C
4 C
Name: col1, dtype: object
(df.groupby('col2')['col1']
.unique()
.str.join(', ')
.to_frame()
.apply(lambda x: f'{x.name} in {x[0]}', axis=1))
col2
0 0 in A, B
1 1 in A, B, C
2 2 in B, C
3 3 in A, C
4 4 in C
dtype: object
答案 1 :(得分:0)
您可以执行以下操作:
aggregated=df.groupby(['col2']).agg({'col1': 'unique'})
输出如下:
col2
0 [A, B]
1 [A, B, C]
2 [B, C]
3 [A, C]
4 [C]
如果您要像示例中那样设置格式,则可以执行:
aggregated.reset_index().apply('{0.col2} in {0.col1}'.format, axis='columns')
所以看起来像这样:
0 0 in ['A' 'B']
1 1 in ['A' 'B' 'C']
2 2 in ['B' 'C']
3 3 in ['A' 'C']
4 4 in ['C']
答案 2 :(得分:0)
select
'2019-10-24 12:07:24.567'::timestamp_ntz as orig
,TIMESTAMP_TZ_FROM_PARTS( year(orig), month(orig),day(orig), hour(orig), minute(orig), second(orig) , date_part(nanosecond, orig), 'UTC' )
,TO_TIMESTAMP_TZ(orig::varchar || ' +0000')
;
--
2019-10-24 12:07:24.567 -- TIMESTAMP_NTZ (UTC)
2019-10-24 12:07:24.567 +0000 -- TIMESTAMP_TZ (UTC)
2019-10-24 12:07:24.567 +0000 -- TIMESTAMP_TZ (UTC)