我今天有一个简单的 Excel 任务要做,我想我会使用一些 Python 来清理。这让我想到了熊猫和 numpy。如果可能的话,这就是我想知道的:
我有这些列和大约 5k 行:
名字 |姓氏 |电子邮件 |地址 |城市
我想删除地址和城市范围内的重复项,但并非所有行都有电子邮件或姓氏。所以我想查看行并删除不包含电子邮件地址的行,保留包含电子邮件地址的行。
但是,我有一些可能具有相同姓氏但没有电子邮件的重复行,因此我想确保至少保留其中一行,或者在电子邮件字段中插入 NAN 或其他内容以便至少保留一行。
我猜在伪代码中是这样的:
1. if Last Name & Address & City is a duplicate & Email Address on both rows is blank
insert a variable into one of the rows Email Address field
2. if Address & City is a duplicate, remove the row that does not have a e-mail address assigned to it.
我已经通过手动进入并执行第 1 步来使其工作,您可以猜到,这并不有趣,哈哈。所以我想知道是否有可能用 Pandas 来做到这一点。
这是示例数据:
df = pd.DataFrame({
"First Name": ["Bob", "Ken", "Bobs Business", "Daniel", "Wendy", "Kyle"],
"Last Name": ["Arnold", "Arnold", "", "Amigo", "Amigo", "Zecke"],
"Email": ["", "", "Bb@bobsbusiness.com", "amigo@amigo.com", "", "k@zecke.com"],
"Address": ["123 Street", "123 Street", "123 Street", "5 Street", "5 Street", "5 Street"],
"City": ["Boston", "Boston", "Boston", "Concord", "Concord", "Denver"]
})
预期的输出:
First Name Last Name Email Address City
Ken Arnold 123 Street Boston
Bobs Business bb@bobsbusiness.com 123 Street Boston
Daniel Amigo amigo@amigo.com 5 Street Concord
Kyle Zecke k@zecke.com 5 Street Denver
感谢您的帮助或为我指明正确的方向! :)
答案 0 :(得分:1)
首先,您应该提供示例数据,以便我们可以轻松地在您的数据上测试代码。 我认为你必须做两件事:
您必须检查是否使用 None 值或 emtpy 字符串,因为它们在排序时表现不同,也许您必须将 keep 更改为“first”。
import pandas as pd
def run():
df = pd.DataFrame({
"First Name": ["John", "", "Jane", ""],
"Last Name": ["Last1", "Last2", "Last3", "Last3"],
"Email": ["", "Email2", None, "Email4"],
"Address": ["Address1", "Address1", "Address2", "Address2"],
"City": ["City1", "City1", "City2", "City2"]
})
print(df)
print()
df.sort_values(by=["Last Name", "Address", "City", "Email"], inplace=True)
df.drop_duplicates(subset=["Last Name", "Address", "City"], keep="last", inplace=True)
print(df)
if __name__ == '__main__':
run()
输出:
First Name Last Name Email Address City
0 Bob Arnold 123 Street Boston
1 Ken Arnold 123 Street Boston
2 Bobs Business Bb@bobsbusiness.com 123 Street Boston
3 Daniel Amigo amigo@amigo.com 5 Street Concord
4 Wendy Amigo 5 Street Concord
5 Kyle Zecke k@zecke.com 5 Street Denver
First Name Last Name Email Address City
2 Bobs Business Bb@bobsbusiness.com 123 Street Boston
3 Daniel Amigo amigo@amigo.com 5 Street Concord
1 Ken Arnold 123 Street Boston
5 Kyle Zecke k@zecke.com 5 Street Denver