基于唯一ID的熊猫文本列分组

时间:2020-06-13 19:52:49

标签: python pandas pandas-groupby

我在csv文件下面,

itemid  testresult      duplicateid
100     textboxerror            0
101     text_input_issue        100
102     menuitemerror           0
103     text_click_issue        100
104     text_caps_error         100
105     menu_drop_down_error    102
106     text_lower_error        100
107     menu_item_null          102

我想根据重复的id将上面的表testreslts转换为两列,结果列为相似的testresults,示例表必须如下所示,

必需的数据框:

index   testresult     similartestresults   duplicateid
1       textboxerror    text_click_issue        100
2       textboxerror    text_caps_error         100
3       textboxerror    text_caps_error         100
4       textboxerror    text_lower_error        100
5       menuitemerror   menu_drop_down_error    102
6       menuitemerror   menu_item_null          102

我尝试使用pandas groupby,但是它只给出单个列表,代码如下,

df1 =  df.groupby(["duplicateid", "testresult"])
print (df1)
print (df1.groups)

df['similartestresults'] = df.groupby("duplicateid")['testresult'].apply(lambda tags: ','.join(tags))
print (df2)

但是以上两种方法均未获得理想的结果。请对此提出建议。 谢谢, TSJ

1 个答案:

答案 0 :(得分:0)

复制测试结果列,并使用前四个字符作为组名进行更新。将其替换为最终的组名。然后删除不必要的列并重新排序。这符合您问题的意图吗?

df['simlartestresult'] = df['testresult'].copy()

# Update to group_name 
df['testresult'] = df['simlartestresult'].apply(lambda x: x[:4])
df['testresult'].replace(['text','menu'],['textboxerror','menuitemerror'],inplace=True)

# delete 'dupulicateid = 0'
df = df[~(df['duplicateid'] == 0)]
df = df.sort_values('duplicateid', ascending=True)

df
    itemid  testresult  duplicateid simlartestresult
1   101 textboxerror    100 text_input_issue
3   103 textboxerror    100 text_click_issue
4   104 textboxerror    100 text_caps_error
6   106 textboxerror    100 text_lower_error
5   105 menuitemerror   102 menu_drop_down_error
7   107 menuitemerror   102 menu_item_null