根据文本版本控制删除重复项

时间:2018-05-21 06:30:28

标签: python pandas dataframe text

这是我的数据,我想过滤最新版本

Id       Score     Version
1           67     One
1           89     Three
2           78     Two
2           70     One

这就是我想要的,因为Three> Two> One

Id       Score     Version
1           89     Three
2           78     Two

我做的是

versions = data.scorecard_version.str.extract('(One|Two|Three)', expand = False)
dummies = pd.get_dummies(versions)
df = pd.concat([df,dummies],axis = 1)
df['versions'] = df['One']*1 + df['Two']*2 + df['Three']*3

然后过滤最大值,但我正在寻求更好的解决方案

1 个答案:

答案 0 :(得分:1)

您可以映射值,排序然后删除重复项:

df = pd.DataFrame([[1,67,'one'], [1, 89, 'three'],
               [2, 78,  'two'], [2, 70, 'one']], columns = ['Id', 'Score', 'Version' ])    
d = {'one':1,'two':2, 'three':3}
df['vers'] = df['Version'].map(d)
df = df.sort_values('vers', ascending=False).drop_duplicates('Id').sort_index()

输出:

   Id  Score Version  vers
1   1     89   three     3
2   2     78     two     2