Question

我有一个pandas DataFrame，其中包含patient_id，patient_sex，patient_dob列（和其他不太相关的列）。行可以有重复的patient_id，因为每个患者可能在多个医疗程序的数据中有多个条目。然而，我发现许多patient_id被超载，即多个患者被分配到相同的身份证明（单个patient_id的多个实例证明与多个性别相关和多天出生。）

要重构ID以便每个患者都有一个独特的，我的计划是不仅要patient_id，而且要patient_sex和patient_dob对数据进行分组。我认为这必须足以将数据分成单个用户（如果两个具有相同性别和dob的患者碰巧被分配了相同的ID，那么就这样吧。

以下是我目前使用的代码：

# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()

# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
    # index is a tuple of the three column values, so this should get me a unique 
    # patient id for each patient
    indv_patients.loc[index, new_patient_id] = str(hash(index))

# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)

# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)

patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})

问题在于，有超过700万患者，这是一个解决方案太慢的问题，最大的瓶颈是for-loop。所以我的问题是，有没有更好的方法来修复这些重载的ID？（实际的id无关紧要，只要它对每位患者都是唯一的）

Answer 1

我不知道列的值是什么，但你尝试过这样的事情吗？

patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)

这应创建一个新列，然后您可以将groupby与new_patient_id

一起使用

Pandas - 拆分和重构ID和列重载

1 个答案: