如何分别提取每列中的重复值?

时间:2019-05-30 10:39:13

标签: python r pandas csv duplicates

我只想分别提取每列中出现两次或多次的值,并将它们写入具有列标题的单独文件中。

示例文件:(实际的csv文件为1.5 Gb,此处包括其摘要) 第一行是每一列的标题行

AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3

我试图用R甚至Python熊猫编写代码,但未能获得结果。

预期结果:

AO1 BO1 CO1 DO1 EO1 FO1
pep2    xcv4    iop3    typ3    ert3    rtf5
pep2    xcv4    iop3    typ3    ert3    rtf5
pep2    xcv4        typ3        rtf5
            wer3        rtf5
            wer3        rtf5

2 个答案:

答案 0 :(得分:0)

import pandas as pd
from StringIO import StringIO

df = pd.read_csv(StringIO("""AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3"""))

d = {}

for col in df.columns:
    repeated_values =  df[col].value_counts()[df[col].value_counts() >= 2].index.tolist()
    cond = df[col].isin(repeated_values)
    d[col] = df[cond][col]

final = pd.concat(d, axis=1)

答案 1 :(得分:0)

df <- data.table::fread('AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3'
                  , data.table = FALSE)

lapply(df, function (x) x[duplicated(x) | duplicated(x, fromLast = T)])

您也可以在lapply调用中直接编写一个csv