Python-删除基于两个列组合的数据框中的重复项?

时间:2018-07-05 01:10:40

标签: python pandas sorting dataframe

我在Python中有一个包含3列的数据框:

{
  "name": "test",
  "version": "1.0.0",
  "description": "",
  "main": "main.js",
  "scripts": {
    "start": "electron .",
    "pack": "build --dir",
    "dist": "build"
  },
  "author": "",
  "license": "ISC",
  "build": {
    "appId": "com.example.app",
    "files": [
      "dist/",
      "node_modules/",
      "index.html",
      "main.js",
      "package.json",
      "renderer.js",
      "styles.css",
      "visitor.py",
      "download.py"
    ],
    "dmg": {
      "contents": [
        {
          "x": 110,
          "y": 150
        },
        {
          "x": 240,
          "y": 150,
          "type": "link",
          "path": "/Applications"
        }
      ]
    },
    "linux": {
      "target": [
        "AppImage",
        "deb"
      ]
    },
    "win": {
      "target": "squirrel",
      "icon": "build/icon.ico"
    }
  },
  "dependencies": {
    "csv-parse": "^2.5.0",
    "electron-css": "^0.6.0",
    "npm": "^6.1.0",
    "python-shell": "^0.5.0",
  },
  "devDependencies": {
    "electron": "^2.0.3",
    "electron-builder": "^20.19.1"
  }
}

,并希望消除基于Name1和Name2组合列的重复项。

在我的示例中,两行相等(但是顺序不同),我想删除第二行并保留第一行,所以最终结果应该是:

Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1

任何想法都将不胜感激!

3 个答案:

答案 0 :(得分:19)

通过将np.sortduplicated一起使用

df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
Out[614]: 
  Name1 Name2  Value
1   Ale  Juan      1

性能

df=pd.concat([df]*100000)

%timeit df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
10 loops, best of 3: 69.3 ms per loop
%timeit df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
1 loop, best of 3: 3.72 s per loop

答案 1 :(得分:18)

您可以转换为frozenset并使用pd.DataFrame.duplicated

res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]

print(res)

  Name1 Name2  Value
0  Juan   Ale      1

frozenset而不是set是必需的,因为duplicated使用散列来检查重复项。

与行相比,对列的缩放更好。对于大量行,请使用@Wen的基于排序的算法。

答案 2 :(得分:4)

知道我对这个问题有点迟,但是无论如何都要给我贡献:)

您也可以使用get_dummiesadd作为创建可散列行的好方法

df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]

时间不如@Wen的答案,但仍然比apply + frozen_set

df=pd.concat([df]*1000000)
%timeit df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
1.8 s ± 85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[pd.DataFrame(np.sort(df[['a','b']].values,1)).duplicated()]
1.26 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[~df[['a', 'b']].apply(frozenset, axis=1).duplicated()]
1min 9s ± 684 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)