当将它们链接在一起时,为什么此重复数据删除代码不起作用?

时间:2019-07-10 12:41:32

标签: python pandas

我要在此数据框中选择重复项:

df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'],
                   'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'],
                   'email':[np.nan,'Bar','Foo Bar','Bar','john@com','mary@com','Jim@com']})

print(df)

  firstname  lastname     email
0     stack       jim       NaN
1   Bar Bar       Bar       Bar
2       NaN   Foo Bar   Foo Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com

此方法似乎可以正常工作:

df = df.dropna(subset=['firstname', 'lastname', 'email'])

df = df[df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False)]

print(df)

  firstname lastname email
1   Bar Bar      Bar   Bar
3   Bar Bar      Bar   Bar

如果我将操作链接起来,那是行不通的:

dupes = (df.dropna(subset=['firstname', 'lastname', 'email'])
                 .duplicated(subset=['firstname', 'lastname', 'email'], keep=False))

df = df[dupes]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

我通常应该远离这样的链接并保持简单吗?这是怎么回事?

2 个答案:

答案 0 :(得分:2)

这是我的预期。

第二个解决方案中的问题是使用已过滤的值进行过滤,因此输出索引与原始索引不同,因此引发了错误。

SELECT email FROM m2n3r_djcf_items WHERE exp_days=29 AND published=1 AND id>8000 AND email LIKE '%@%' AND promotions NOT LIKE '%cat%'

在第一个示例中,您将使用已过滤的数据进行过滤,因此索引相同且工作良好:

print(df)
  firstname  lastname     email
0     stack       jim       NaN
1   Bar Bar       Bar       Bar
2       NaN   Foo Bar   Foo Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com

dupes = (df.dropna(subset=['firstname', 'lastname', 'email'])
                 .duplicated(subset=['firstname', 'lastname', 'email'], keep=False))

print(dupes)
1     True
3     True
4    False
5    False
6    False
dtype: bool

可能的解决方案是使用Series.reindex

df = df.dropna(subset=['firstname', 'lastname', 'email'])
print(df)
  firstname  lastname     email
1   Bar Bar       Bar       Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com

print(df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
1     True
3     True
4    False
5    False
6    False
dtype: bool


df = df[df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False)]
print(df)
  firstname lastname email
1   Bar Bar      Bar   Bar
3   Bar Bar      Bar   Bar

答案 1 :(得分:1)

在第一个示例中,您通过分配数据框来更新数据框,如果在drop na之后打印数据框,则可以看到索引已更改:

df = df.dropna(subset=['firstname', 'lastname', 'email'])
print(df)

    firstname  lastname     email
1   Bar Bar       Bar       Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com

链接操作的问题在于您没有更改数据框的索引,但是dupes系列的行却较少。

dupes =  df.dropna(subset=['firstname', 'lastname', 'email']).duplicated(subset=['firstname', 'lastname', 'email'], keep=False)
print(dupes)
print(df)

1     True
3     True
4    False
5    False
6    False
dtype: bool

  firstname  lastname     email
0     stack       jim       NaN
1   Bar Bar       Bar       Bar
2       NaN   Foo Bar   Foo Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com

当您尝试通过使用dupes系列建立索引来从Dataframe中获取行时,会出现错误,因为索引不匹配。