向pandas数据框添加新列以指示特定列中的重复项

时间:2017-11-14 08:19:31

标签: python pandas dataframe

我是熊猫的新手,我在编写根据自定义条件添加新列的功能方面存在问题: 以下是我的数据框:

    c1      c2     c3           c4              c5
0   1234    888    36.12733265  -115.1710473    7048929337
1   2341    70     33.62503113  -111.928576     7048929337
2   8910    419    40.734631    -73.8700321     9192939495
3   8910    910    40.734631    -73.8700321     9192939495
4   5678    1295   40.719729    -73.84412       5109400188
5   3345    4976   33.5350596   -112.2670918    9192939495
6   233345  2364   33.5350596   -112.2670918    4806391796
7   3010    1155   42.8254528   -71.5012724     2393900772
8   3010    6800   41.0488534   -75.313324      8434975913
9   4534    1791   42.955875    -76.92238325    9048190206
10  7658    4711   40.7635948   -73.3066489     6312542029
11  7658    9120   34.8465348   -117.0854289    6312542029

我想添加名为dup的新列,其中包含用于指示该行是否重复(根据特定列重复)的标志。 优先顺序如下:

1。)如果行在c3和c4方面都是重复的,那么flag应该是dup_c3c4

2.)否则如果行在c5方面是重复的,则flag应为dup_c5

3。)否则如果行在c1方面是重复的,那么flag应该是dup_c1

4.)else标志应该是NaD(不重复)。

预期产出:

    c1      c2      c3            c4            c5          DUP
0   1234    888     36.12733265  -115.1710473   7048929337  dup_c5
1   2341    70      33.62503113  -111.928576    7048929337  dup_c5
2   8910    419     40.734631    -73.8700321    9192939495  dup_c4c5
3   8910    910     40.734631    -73.8700321    9192939495  dup_c4c5
4   5678    1295    40.719729    -73.84412      5109400188  NaD
5   3345    4976    33.5350596   -112.2670918   9192939495  dup_c4c5
6   233345  2364    33.5350596   -112.2670918   4806391796  dup_c4c5
7   3010    1155    42.8254528   -71.5012724    2393900772  dup_c1
8   3010    6800    41.0488534   -75.313324     8434975913  dup_c1
9   4534    1791    42.955875    -76.92238325   9048190206  NaD
10  7658    4711    40.7635948   -73.3066489    6312542029  dup_c5
11  7658    9120    34.8465348   -117.0854289   6312542029  dup_c5

任何人都可以建议我如何使用if else或任何其他有效方式为此场景编写自定义函数。

1 个答案:

答案 0 :(得分:1)

numpy.selectduplicated一起用于3种不同的条件:

m1 = df.duplicated(['c3','c4'], keep=False)
m2 = df.duplicated(['c5'], keep=False)
m3 = df.duplicated(['c1'], keep=False)

df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
print (df)
        c1    c2         c3          c4          c5       DUP
0     1234   888  36.127333 -115.171047  7048929337    dup_c5
1     2341    70  33.625031 -111.928576  7048929337    dup_c5
2     8910   419  40.734631  -73.870032  9192939495  dup_c3c4
3     8910   910  40.734631  -73.870032  9192939495  dup_c3c4
4     5678  1295  40.719729  -73.844120  5109400188       NaD
5     3345  4976  33.535060 -112.267092  9192939495  dup_c3c4
6   233345  2364  33.535060 -112.267092  4806391796  dup_c3c4
7     3010  1155  42.825453  -71.501272  2393900772    dup_c1
8     3010  6800  41.048853  -75.313324  8434975913    dup_c1
9     4534  1791  42.955875  -76.922383  9048190206       NaD
10    7658  4711  40.763595  -73.306649  6312542029    dup_c5
11    7658  9120  34.846535 -117.085429  6312542029    dup_c5

如果需要功能:

def f(df):
    m1 = df.duplicated(['c3','c4'], keep=False)
    m2 = df.duplicated(['c5'], keep=False)
    m3 = df.duplicated(['c1'], keep=False)

    df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
    return df

df1 = f(df)