关于dataframe.duplicated不删除重复项的问题

时间:2019-07-08 09:13:41

标签: pandas dataframe

duplicated())对于所有内容都返回false,即使只有index例外,它们都是相同的。

输入

    old_data = old_data.loc[:, ~old_data.columns.str.contains('^Unnamed')]
    print("bottom_slice")
    bottom_slice_length = len(old_data.index)
    adjusted_bottom_slice_legth = bottom_slice_length * 0.1
    adjusted_bottom_slice_legth = int(adjusted_bottom_slice_legth)
    bottom_slice = old_data[adjusted_bottom_slice_legth:]
    print(bottom_slice)
    new_data = pd.DataFrame.from_records(journal.data)
    top_slice_length = len(new_data.index)
    print("top slice")
    adjusted_top_slice_legth = top_slice_length * 0.9
    adjusted_top_slice_legth = int(adjusted_top_slice_legth)
    top_slice = new_data[:adjusted_top_slice_legth]
    print(top_slice)
    kimera = pd.concat([top_slice, bottom_slice])
    #print("kimera")
    #print(kimera)
    print(kimera.duplicated())
    #kimera = kimera.drop_duplicates() 
    print("kimera1")
    print(kimera)

输出

bottom_slice
     client_id                       date  ...  type_id   unit_price
4     94904480  2019-06-30T01:31:01+00:00  ...    11186  37177999.84
5   2113704258  2019-06-29T10:46:53+00:00  ...    12044  33996998.00
6   2115385566  2019-06-27T12:07:58+00:00  ...    11393  44899999.98
7   1732767131  2019-06-27T09:22:24+00:00  ...       38       325.24
8     93204128  2019-06-26T20:47:01+00:00  ...    11198  35999999.98
9     90216786  2019-06-25T23:51:48+00:00  ...    11172  35999999.99
10    91205905  2019-06-25T19:59:21+00:00  ...    16275       600.00
11  2113996003  2019-06-25T16:52:14+00:00  ...    11190  39999999.96
12    96345205  2019-06-25T16:39:49+00:00  ...    16275       600.00
13    95103814  2019-06-25T01:16:28+00:00  ...    11202  29999998.93
14   543983309  2019-06-24T14:05:49+00:00  ...    11172  27415377.17
15  2114159703  2019-06-23T21:20:04+00:00  ...       34         6.30
16  2114159703  2019-06-23T15:28:37+00:00  ...    16274       850.00
17  1872130440  2019-06-23T10:02:21+00:00  ...    11400  38498999.98
18  2112790910  2019-06-23T00:00:46+00:00  ...    11202  28394499.36
19  2115326382  2019-06-22T22:42:00+00:00  ...    11371  37150194.88
20    96768321  2019-06-22T17:02:14+00:00  ...    37481  88999999.99
21  1009077082  2019-06-21T23:35:03+00:00  ...    11379  42000000.00
22   755876330  2019-06-21T12:27:59+00:00  ...    11186  37177999.86
23  1556713165  2019-06-20T23:27:23+00:00  ...    11393  36997999.87
24   513171897  2019-06-19T15:58:51+00:00  ...    11381  43817993.86
25    96711003  2019-06-18T17:50:15+00:00  ...    11198  36999999.99
26   408059764  2019-06-18T15:36:49+00:00  ...    11172  35000000.00
27  1276544138  2019-06-17T21:32:47+00:00  ...    11379  41000000.00
28    94184713  2019-06-17T03:30:26+00:00  ...    37481  86999999.99
29  2113441660  2019-06-16T04:12:59+00:00  ...    37458  34948998.99
30   755284989  2019-06-15T19:54:44+00:00  ...    37458  34999999.97
31  1731319339  2019-06-13T12:00:14+00:00  ...    11379  42000000.00
32    96053157  2019-06-12T04:07:15+00:00  ...    37483  85500002.17
33  1690931127  2019-06-12T00:44:40+00:00  ...    37482  61699999.97
34    92812153  2019-06-11T05:23:09+00:00  ...    37460  36499999.99
35  2114791711  2019-06-10T16:14:59+00:00  ...    11371  41499999.99
36  1547875730  2019-06-10T15:22:53+00:00  ...    17887       999.99
37   227535700  2019-06-10T15:12:06+00:00  ...    16272       544.50
38    95165645  2019-06-10T06:32:52+00:00  ...    11393  53989999.99
39  1859791498  2019-06-10T05:35:57+00:00  ...    22460  62000000.00
40  2112629749  2019-06-09T15:46:46+00:00  ...     2549   1800000.00
41    94391975  2019-06-08T00:06:12+00:00  ...    37460  36499999.99
42    91521700  2019-06-07T14:11:45+00:00  ...    11393  49997999.98
43  1171184159  2019-06-06T18:10:19+00:00  ...    12044  33997997.81
44    96410073  2019-06-05T17:32:01+00:00  ...    11371  46999999.96

[41 rows x 10 columns]
top slice
     client_id                       date  ...  type_id    unit_price
0     96644839  2019-07-07T02:02:45+00:00  ...    37457  2.900000e+07
1   2113806433  2019-07-06T18:13:12+00:00  ...    37482  7.300000e+07
2   1240358507  2019-07-05T19:38:20+00:00  ...    11381  4.399900e+07
3     97005654  2019-07-05T04:12:23+00:00  ...       38  3.999900e+02
4     97005654  2019-07-05T02:49:26+00:00  ...       38  3.999900e+02
5   1857838543  2019-07-03T20:08:15+00:00  ...    37482  6.900000e+07
6     92337897  2019-07-03T14:44:32+00:00  ...    11365  4.480000e+07
7   2114793091  2019-07-01T23:04:26+00:00  ...    12044  3.000000e+07
8     95826459  2019-06-30T07:22:45+00:00  ...    37482  1.190000e+08
9     94904480  2019-06-30T01:31:01+00:00  ...    11186  3.717800e+07
10  2113704258  2019-06-29T10:46:53+00:00  ...    12044  3.399700e+07
11  2115385566  2019-06-27T12:07:58+00:00  ...    11393  4.490000e+07
12  1732767131  2019-06-27T09:22:24+00:00  ...       38  3.252400e+02
13    93204128  2019-06-26T20:47:01+00:00  ...    11198  3.600000e+07
14    90216786  2019-06-25T23:51:48+00:00  ...    11172  3.600000e+07
15    91205905  2019-06-25T19:59:21+00:00  ...    16275  6.000000e+02
16  2113996003  2019-06-25T16:52:14+00:00  ...    11190  4.000000e+07
17    96345205  2019-06-25T16:39:49+00:00  ...    16275  6.000000e+02
18    95103814  2019-06-25T01:16:28+00:00  ...    11202  3.000000e+07
19   543983309  2019-06-24T14:05:49+00:00  ...    11172  2.741538e+07
20  2114159703  2019-06-23T21:20:04+00:00  ...       34  6.300000e+00
21  2114159703  2019-06-23T15:28:37+00:00  ...    16274  8.500000e+02
22  1872130440  2019-06-23T10:02:21+00:00  ...    11400  3.849900e+07
23  2112790910  2019-06-23T00:00:46+00:00  ...    11202  2.839450e+07
24  2115326382  2019-06-22T22:42:00+00:00  ...    11371  3.715019e+07
25    96768321  2019-06-22T17:02:14+00:00  ...    37481  8.900000e+07
26  1009077082  2019-06-21T23:35:03+00:00  ...    11379  4.200000e+07
27   755876330  2019-06-21T12:27:59+00:00  ...    11186  3.717800e+07
28  1556713165  2019-06-20T23:27:23+00:00  ...    11393  3.699800e+07
29   513171897  2019-06-19T15:58:51+00:00  ...    11381  4.381799e+07
30    96711003  2019-06-18T17:50:15+00:00  ...    11198  3.700000e+07
31   408059764  2019-06-18T15:36:49+00:00  ...    11172  3.500000e+07
32  1276544138  2019-06-17T21:32:47+00:00  ...    11379  4.100000e+07
33    94184713  2019-06-17T03:30:26+00:00  ...    37481  8.700000e+07
34  2113441660  2019-06-16T04:12:59+00:00  ...    37458  3.494900e+07
35   755284989  2019-06-15T19:54:44+00:00  ...    37458  3.500000e+07
36  1731319339  2019-06-13T12:00:14+00:00  ...    11379  4.200000e+07
37    96053157  2019-06-12T04:07:15+00:00  ...    37483  8.550000e+07
38  1690931127  2019-06-12T00:44:40+00:00  ...    37482  6.170000e+07
39    92812153  2019-06-11T05:23:09+00:00  ...    37460  3.650000e+07
40  2114791711  2019-06-10T16:14:59+00:00  ...    11371  4.150000e+07
41  1547875730  2019-06-10T15:22:53+00:00  ...    17887  9.999900e+02

[42 rows x 10 columns]
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
Length: 83, dtype: bool
kimera1
     client_id                       date  ...  type_id    unit_price
0     96644839  2019-07-07T02:02:45+00:00  ...    37457  2.900000e+07
1   2113806433  2019-07-06T18:13:12+00:00  ...    37482  7.300000e+07
2   1240358507  2019-07-05T19:38:20+00:00  ...    11381  4.399900e+07
3     97005654  2019-07-05T04:12:23+00:00  ...       38  3.999900e+02
4     97005654  2019-07-05T02:49:26+00:00  ...       38  3.999900e+02
5   1857838543  2019-07-03T20:08:15+00:00  ...    37482  6.900000e+07
6     92337897  2019-07-03T14:44:32+00:00  ...    11365  4.480000e+07
7   2114793091  2019-07-01T23:04:26+00:00  ...    12044  3.000000e+07
8     95826459  2019-06-30T07:22:45+00:00  ...    37482  1.190000e+08
9     94904480  2019-06-30T01:31:01+00:00  ...    11186  3.717800e+07
10  2113704258  2019-06-29T10:46:53+00:00  ...    12044  3.399700e+07
11  2115385566  2019-06-27T12:07:58+00:00  ...    11393  4.490000e+07
12  1732767131  2019-06-27T09:22:24+00:00  ...       38  3.252400e+02
13    93204128  2019-06-26T20:47:01+00:00  ...    11198  3.600000e+07
14    90216786  2019-06-25T23:51:48+00:00  ...    11172  3.600000e+07
15    91205905  2019-06-25T19:59:21+00:00  ...    16275  6.000000e+02
16  2113996003  2019-06-25T16:52:14+00:00  ...    11190  4.000000e+07
17    96345205  2019-06-25T16:39:49+00:00  ...    16275  6.000000e+02
18    95103814  2019-06-25T01:16:28+00:00  ...    11202  3.000000e+07
19   543983309  2019-06-24T14:05:49+00:00  ...    11172  2.741538e+07
20  2114159703  2019-06-23T21:20:04+00:00  ...       34  6.300000e+00
21  2114159703  2019-06-23T15:28:37+00:00  ...    16274  8.500000e+02
22  1872130440  2019-06-23T10:02:21+00:00  ...    11400  3.849900e+07
23  2112790910  2019-06-23T00:00:46+00:00  ...    11202  2.839450e+07
24  2115326382  2019-06-22T22:42:00+00:00  ...    11371  3.715019e+07
25    96768321  2019-06-22T17:02:14+00:00  ...    37481  8.900000e+07
26  1009077082  2019-06-21T23:35:03+00:00  ...    11379  4.200000e+07
27   755876330  2019-06-21T12:27:59+00:00  ...    11186  3.717800e+07
28  1556713165  2019-06-20T23:27:23+00:00  ...    11393  3.699800e+07
29   513171897  2019-06-19T15:58:51+00:00  ...    11381  4.381799e+07
..         ...                        ...  ...      ...           ...
15  2114159703  2019-06-23T21:20:04+00:00  ...       34  6.300000e+00
16  2114159703  2019-06-23T15:28:37+00:00  ...    16274  8.500000e+02
17  1872130440  2019-06-23T10:02:21+00:00  ...    11400  3.849900e+07
18  2112790910  2019-06-23T00:00:46+00:00  ...    11202  2.839450e+07
19  2115326382  2019-06-22T22:42:00+00:00  ...    11371  3.715019e+07
20    96768321  2019-06-22T17:02:14+00:00  ...    37481  8.900000e+07
21  1009077082  2019-06-21T23:35:03+00:00  ...    11379  4.200000e+07
22   755876330  2019-06-21T12:27:59+00:00  ...    11186  3.717800e+07
23  1556713165  2019-06-20T23:27:23+00:00  ...    11393  3.699800e+07
24   513171897  2019-06-19T15:58:51+00:00  ...    11381  4.381799e+07
25    96711003  2019-06-18T17:50:15+00:00  ...    11198  3.700000e+07
26   408059764  2019-06-18T15:36:49+00:00  ...    11172  3.500000e+07
27  1276544138  2019-06-17T21:32:47+00:00  ...    11379  4.100000e+07
28    94184713  2019-06-17T03:30:26+00:00  ...    37481  8.700000e+07
29  2113441660  2019-06-16T04:12:59+00:00  ...    37458  3.494900e+07
30   755284989  2019-06-15T19:54:44+00:00  ...    37458  3.500000e+07
31  1731319339  2019-06-13T12:00:14+00:00  ...    11379  4.200000e+07
32    96053157  2019-06-12T04:07:15+00:00  ...    37483  8.550000e+07
33  1690931127  2019-06-12T00:44:40+00:00  ...    37482  6.170000e+07
34    92812153  2019-06-11T05:23:09+00:00  ...    37460  3.650000e+07
35  2114791711  2019-06-10T16:14:59+00:00  ...    11371  4.150000e+07
36  1547875730  2019-06-10T15:22:53+00:00  ...    17887  9.999900e+02
37   227535700  2019-06-10T15:12:06+00:00  ...    16272  5.445000e+02
38    95165645  2019-06-10T06:32:52+00:00  ...    11393  5.399000e+07
39  1859791498  2019-06-10T05:35:57+00:00  ...    22460  6.200000e+07
40  2112629749  2019-06-09T15:46:46+00:00  ...     2549  1.800000e+06
41    94391975  2019-06-08T00:06:12+00:00  ...    37460  3.650000e+07
42    91521700  2019-06-07T14:11:45+00:00  ...    11393  4.999800e+07
43  1171184159  2019-06-06T18:10:19+00:00  ...    12044  3.399800e+07
44    96410073  2019-06-05T17:32:01+00:00  ...    11371  4.700000e+07

[83 rows x 10 columns]

我希望合并两个不同的数据框,以消除重复的数据,如果它们在保存到现在的日期时顺序混乱,我希望我可以使用它们。 但是目前我无法消除任何重复项。

1 个答案:

答案 0 :(得分:0)

选择要比较的列。例如,如果您不关心client_id是否与众不同,则不理会它。我会这样:

#Choose all columns but "client_id"
cols_to_compare = list(kimera1.columns.difference(["client_id"]))

#Drop rows based on subset of your choice
kimera1.drop_duplicates(subset=cols_to_compare, keep='first', inplace=True)

这对您有用吗?