我是熊猫的新手,我在编写根据自定义条件添加新列的功能方面存在问题: 以下是我的数据框:
c1 c2 c3 c4 c5
0 1234 888 36.12733265 -115.1710473 7048929337
1 2341 70 33.62503113 -111.928576 7048929337
2 8910 419 40.734631 -73.8700321 9192939495
3 8910 910 40.734631 -73.8700321 9192939495
4 5678 1295 40.719729 -73.84412 5109400188
5 3345 4976 33.5350596 -112.2670918 9192939495
6 233345 2364 33.5350596 -112.2670918 4806391796
7 3010 1155 42.8254528 -71.5012724 2393900772
8 3010 6800 41.0488534 -75.313324 8434975913
9 4534 1791 42.955875 -76.92238325 9048190206
10 7658 4711 40.7635948 -73.3066489 6312542029
11 7658 9120 34.8465348 -117.0854289 6312542029
我想添加名为dup的新列,其中包含用于指示该行是否重复(根据特定列重复)的标志。 优先顺序如下:
1。)如果行在c3和c4方面都是重复的,那么flag应该是dup_c3c4
2.)否则如果行在c5方面是重复的,则flag应为dup_c5
3。)否则如果行在c1方面是重复的,那么flag应该是dup_c1
4.)else标志应该是NaD(不重复)。
预期产出:
c1 c2 c3 c4 c5 DUP
0 1234 888 36.12733265 -115.1710473 7048929337 dup_c5
1 2341 70 33.62503113 -111.928576 7048929337 dup_c5
2 8910 419 40.734631 -73.8700321 9192939495 dup_c4c5
3 8910 910 40.734631 -73.8700321 9192939495 dup_c4c5
4 5678 1295 40.719729 -73.84412 5109400188 NaD
5 3345 4976 33.5350596 -112.2670918 9192939495 dup_c4c5
6 233345 2364 33.5350596 -112.2670918 4806391796 dup_c4c5
7 3010 1155 42.8254528 -71.5012724 2393900772 dup_c1
8 3010 6800 41.0488534 -75.313324 8434975913 dup_c1
9 4534 1791 42.955875 -76.92238325 9048190206 NaD
10 7658 4711 40.7635948 -73.3066489 6312542029 dup_c5
11 7658 9120 34.8465348 -117.0854289 6312542029 dup_c5
任何人都可以建议我如何使用if else或任何其他有效方式为此场景编写自定义函数。
答案 0 :(得分:1)
将numpy.select
与duplicated
一起用于3种不同的条件:
m1 = df.duplicated(['c3','c4'], keep=False)
m2 = df.duplicated(['c5'], keep=False)
m3 = df.duplicated(['c1'], keep=False)
df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
print (df)
c1 c2 c3 c4 c5 DUP
0 1234 888 36.127333 -115.171047 7048929337 dup_c5
1 2341 70 33.625031 -111.928576 7048929337 dup_c5
2 8910 419 40.734631 -73.870032 9192939495 dup_c3c4
3 8910 910 40.734631 -73.870032 9192939495 dup_c3c4
4 5678 1295 40.719729 -73.844120 5109400188 NaD
5 3345 4976 33.535060 -112.267092 9192939495 dup_c3c4
6 233345 2364 33.535060 -112.267092 4806391796 dup_c3c4
7 3010 1155 42.825453 -71.501272 2393900772 dup_c1
8 3010 6800 41.048853 -75.313324 8434975913 dup_c1
9 4534 1791 42.955875 -76.922383 9048190206 NaD
10 7658 4711 40.763595 -73.306649 6312542029 dup_c5
11 7658 9120 34.846535 -117.085429 6312542029 dup_c5
如果需要功能:
def f(df):
m1 = df.duplicated(['c3','c4'], keep=False)
m2 = df.duplicated(['c5'], keep=False)
m3 = df.duplicated(['c1'], keep=False)
df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
return df
df1 = f(df)