识别pandas中的重复记录

时间:2017-11-27 13:50:22

标签: pandas

我有两个tsv文件如下。

tsv档案编号1

id    ingredients    recipe
code1  egg, butter   beat eggs. add butter
code2  tim tam, butter  beat tim tam. add butter
code3  coffee, sugar   add coffee and sugar and mix
code4  sugar, milk   beat sugar and milk together

tsv档案编号2

id    ingredients    recipe
c009  apple, milk     add apples to milk
c110  coffee, sugar   add coffee and sugar and mix
c111  egg, butter   add egg, butter and sugar
c112  tim tam, sugar  beat tim tam. add butter

我想删除tsv文件中的条目,如果,

  1. 他们有共享的成分(例如code3和c110)
  2. 他们有共享食谱(例如,code2和c112)
  3. 在上面的示例中,两个tsv文件的输出应如下所示。

    tsv档案编号1

    id    ingredients    recipe
    code4  sugar, milk   beat sugar and milk together
    

    tsv档案编号2

    id    ingredients    recipe
    c009  apple, milk     add apples to milk
    

    我们可以用熊猫这样做吗?请帮帮我!

1 个答案:

答案 0 :(得分:1)

您可以阅读正在使用的tsv文件pd.read_csv

df1 = pd.read_csv(tsv_file_1, sep='\s\s+')
df2 = pd.read_csv(tsv_file_2, sep='\s\s+')

#Deal with spaces in columns names
df1.columns = df1.columns.str.strip()
df2.columns = df2.columns.str.strip()

接下来使用isin~(非运营商):

df1_new = df1[~df1.ingredients.isin(df2.ingredients)]
df2_new = df2[~df2.ingredients.isin(df1.ingredients)]

print(df1_new)

      id  ingredients                        recipe
3  code4  sugar, milk  beat sugar and milk together

print(df2_new)

     id  ingredients              recipe
0  c009  apple, milk  add apples to milk