pandas从具有某些特定条件的数据帧(分组)中删除重复项

时间:2016-11-22 08:32:42

标签: python string pandas group-by duplicates

大家好我有一个数据框,其内容就像

name,mv_str
abc,Exorsist part1
abc,doc str 2D
abc,doc str 3D
abc,doc str QA
abc,doc flash
def,plastic
def,plastic income
def,doc str 2D   ###i added this row for better clarity

我的预期o / p应该......每组都有唯一的记录行 - 对于每个mailid mv_str不应该是类似的类型i:e。来自一个'mv_str'的第一个2个单词不应该在那里在该特定用户名的第2行/任何行中。

注意:每个用户名都应该进行比较。

name,mv_str
abc,Exorist part1
abc,doc str 2D   ###3D and QA removes as 1st 2 words "doc str" matched
abc, doc flash   ###only 1st word is matching, 2nd word does not
def,plastic
def,plastic income  #It should be present as only one word is matching
def,doc str 2D   ###this row should be there as this is for another User
请任何人帮我组建逻辑,否则代码示例会很棒帮助。感谢。

1 个答案:

答案 0 :(得分:2)

我认为您需要findViewById()mv_str中的第一个split字符串并创建新的whitespace DataFrame

df1

通过concat添加原始df1 = df.mv_str.str.split(expand=True) print (df1) 0 1 2 0 Exorsist part1 None 1 doc str 2D 2 doc str 3D 3 doc str QA 4 doc flash None 5 plastic None None 6 plastic income None 7 doc str 2D DataFrame

df

然后按df = pd.concat([df, df1], axis=1) print (df) name mv_str 0 1 2 0 abc Exorsist part1 Exorsist part1 None 1 abc doc str 2D doc str 2D 2 abc doc str 3D doc str 3D 3 abc doc str QA doc str QA 4 abc doc flash doc flash None 5 def plastic plastic None None 6 def plastic income plastic income None 7 def doc str 2D doc str 2D name0drop_duplicates,第一个值仍为:

1

drop删除列print (df.drop_duplicates(['name',0,1])) name mv_str 0 1 2 0 abc Exorsist part1 Exorsist part1 None 1 abc doc str 2D doc str 2D 4 abc doc flash doc flash None 5 def plastic plastic None None 6 def plastic income plastic income None 7 def doc str 2D doc str 2D 01

2

或者通过仅选择print (df.drop_duplicates(['name',0,1]).drop([0,1,2], axis=1)) name mv_str 0 abc Exorsist part1 1 abc doc str 2D 4 abc doc flash 5 def plastic 6 def plastic income 7 def doc str 2D name列来更好地删除列:

mv_str