我的数据框的列中有重复值的组。我想要的是在此类列中仅保留第一项。
我尝试过df = df.groupby(['author', 'key'])
,但不知道如何正确获取所有行。使用df.first()
时,只会打印第一行。
import pandas as pd
lst = [
['juli', 'JIRA-1', 'assignee'],
['juli', 'JIRA-1', 'assignee'],
['nick', 'JIRA-1', 'timespent'],
['nick', 'JIRA-3', 'status'],
['nick', 'JIRA-3', 'assignee'],
['tom', 'JIRA-1', 'comment'],
['tom', 'JIRA-1', 'assignee'],
['tom', 'JIRA-2', 'status']]
df = pd.DataFrame(lst, columns =['author', 'key', 'field'])
#df = df.sort_values(by=['author', 'key'])
>>> df
author key field
0 juli JIRA-1 assignee
1 juli JIRA-1 assignee
2 nick JIRA-1 timespent
3 nick JIRA-3 status
4 nick JIRA-3 assignee
5 tom JIRA-1 comment
6 tom JIRA-1 assignee
7 tom JIRA-2 status
我得到了什么
>>> df.groupby(['author', 'key']).first()
field
author key
juli JIRA-1 assignee
nick JIRA-1 timespent
JIRA-3 status
tom JIRA-1 comment
JIRA-2 status
我想要什么:
juli JIRA-1 assignee
assignee
nick JIRA-1 timespent
JIRA-3 status
assignee
tom JIRA-1 comment
assignee
JIRA-2 status
答案 0 :(得分:1)
好像您需要df.duplicated()
来查找重复项,而df.loc[]
来分配空格:
df.loc[df.duplicated(['author','key']),['author','key']]=''
print(df)
author key field
0 juli JIRA-1 assignee
1 assignee
2 nick JIRA-1 timespent
3 nick JIRA-3 status
4 assignee
5 tom JIRA-1 comment
6 assignee
7 tom JIRA-2 status