我有一个数据框:
id time
Uk6 year
36h year
Uk6 two-year
rf5 month
gg7 year
rf5 half-year
我需要删除基于“ id”列的重复项,并将重复行的时间值替换为“ unknown”。结果应该是:
id time
Uk6 unknown
36h year
rf5 unknown
gg7 year
我尝试了对先前问题(like_this)的建议答案,但它们不起作用。
答案 0 :(得分:3)
尝试以下
array = []
for l in self.pairs
string = "%d - %d" % (self.ID, l)
array.append(string)
print ", ".join(array)
1-2, 1-3
2-1, 2-4
3-1, 3-4
4-2, 4-3
输出
# create the dataframe
df = pd.DataFrame(data={'id': ['Uk6', '36h', 'Uk6', 'rf5', 'gg7', 'rf5'],
'time': ['year', 'year', 'two-year', 'month', 'year', 'half-year']})
# get duplicated id's
dups_id = df[df.duplicated(subset='id')]['id']
# remove rows from dataframe with id that has duplicated rows
df = df.drop_duplicates(subset='id')
# replace values of 'time' for those rows with duplicated id's with 'unknown'
df.loc[:,'time'] = df['time'].where(~df['id'].isin(dups_id), other='unknown')
答案 1 :(得分:2)
您可以先获取重复项的索引,然后将相应的time
值替换为unknown
,最后删除重复项:
import pandas as pd
df = pd.DataFrame({'id': ["Uk6", "36h", "Uk6", "rf5", "gg7", "rf5"],
'time': ["year", "year", "two-year", "month", "year", "half-year"]})
mask = df.duplicated(subset= 'id', keep=False)
df['time'][mask] = "unknown"
df = df.drop_duplicates('id')
答案 2 :(得分:1)
使用loc替换未知和删除重复的测试
df.loc[df.id.duplicated(keep = False), 'time'] = 'unknown'
df = df.drop_duplicates()
id time
0 Uk6 unknown
1 36h year
3 rf5 unknown
4 gg7 year