Question

请原谅我，因为我仍然想知道如何通过Python清理数据。

我有一个数据集，其中的列需要清洗。这是一个包含多个语句的字符串列，但有点类似。我附上了频率表以供参考：https://gyazo.com/7070364e424eae3e3e40b76cb3fba4e9

我试图将一个.str.contains与一个np.where方法一起使用，但是字符串值太相似以至于它无法正常工作。还有其他策略可以帮助重新编码列吗？

这是我的尝试：

dm = pt_df['PAT_DECISION_MAKING']

myself = dm.str.contains('Autonomous', case = True)
our_fam = dm.str.contains('family centered', case = True)
auth1 = dm.str.contains('authority figure', case = True)
both = dm.str.contains('a.|b.', case = True)

pt_df['PAT_DECISION_MAKING'] = np.where(myself, 'Myself',
                                   np.where(our_fam, 'Family Centered',
                                            np.where(auth1, 'Authority Figure',
                                                     np.where(both, 'Multiple',
                                                              es.str.replace('-', '')))))

pt_df['PAT_DECISION_MAKING'] = pd.Categorical(pt_df.PAT_DECISION_MAKING)

Answer 1

将列强制转换为类别，然后重铸为类别可能会有所帮助。之后，您可以轻松地使用cat.codes转换为类别。

使您的列成为类别dtype

pt_df['PAT_DECISION_MAKING'] = pt_df['PAT_DECISION_MAKING'].astype('category')

使用此列并分配类别代码。

pt_df['PAT_DECISION_MAKING'] = pt_df['PAT_DECISION_MAKING'].apply(lambda x: x.cat.codes)

将字符串数据重新编码为类别的Python方法？

1 个答案: