我是一位试图分析某些患者数据的外科医生,我有一个进行过多次手术的患者(271x15)的数据框。这来自使用@Arne的大量帮助的较大的(4010x71)单个操作数据框。本质上(请参阅帖子 original post),然后使用数据透视表查找多个(> = 2)操作。这很棒。我对前两个手术和日期感兴趣,以便获得两次手术之间的天数,以了解植入物可以持续多长时间。 数据帧头在此处,显示患者ID和植入物插入和取出的代码(V011和V014)。
OPERTN_01 OPDATE_01
ID
11 [V011, V014] [2016-06-21, 2017-02-27]
13 [V011, V014] [2016-07-14, 2016-01-14]
14 [V014, V011] [2014-02-25, 2014-07-01]
15 [V014, V011] [2014-06-26, 2015-04-16]
我希望将这两个操作的日期相减
pd.datetime
我被卡在卸下支架上。我尝试过替换df.replace("[", "")
,它对数据框或系列OPERTN_01
没有影响。理想情况下,我想在整个数据框中删除方括号,而不是逐列删除。
在此数据框中生成的列表(感谢@Arne)产生了很好的描述性统计信息,但我很难操纵。
我还有一个问题,OPDATE_01中的日期未排序,因此日期之间的差异通常为负数。当然,我可能想在某一课程中做太多事情。
答案 0 :(得分:1)
您是否正在寻找这样的东西:
from io import StringIO
import ast
import pandas as pd
# ------ create sample data ------
s = """ID;OPERTN_01;OPDATE_01
11;["V011", "V014"];["2016-06-21", "2017-02-27"]
13;["V011", "V014"];["2016-07-14", "2016-01-14"]
14;["V014", "V011"];["2014-02-25", "2014-07-01"]
15;["V014", "V011"];["2014-06-26", "2015-04-16"]"""
df = pd.read_csv(StringIO(s), sep=';')
df['OPERTN_01'] = df['OPERTN_01'].apply(ast.literal_eval)
df['OPDATE_01'] = df['OPDATE_01'].apply(ast.literal_eval)
df = df.set_index('ID')
# ------ end sample data ------
# list comprehension to sort and convert str to datetime
df['OPDATE_01'] = [sorted([pd.to_datetime(x[0]), pd.to_datetime(x[1])]) for x in df['OPDATE_01']]
# if your values in the list are already datetime then ignore what is above and do
# df['OPDATE_01'] = df['OPDATE_01'].apply(sorted)
# apply pd.Series to explode your list into columns and then rename col if you want
date = df['OPDATE_01'].apply(pd.Series).rename(columns={0:'OPDATE_01_0', 1:'OPDATE_01_1'})
# calculate the difference between dates
date.diff(axis=1)
OPDATE_01_0 OPDATE_01_1
ID
11 NaT 251 days
13 NaT 182 days
14 NaT 126 days
15 NaT 294 days
# list comprehension to sort and convert list to datetime
df['OPDATE_01'] = [sorted([pd.to_datetime(x[0]), pd.to_datetime(x[1])]) for x in df['OPDATE_01']]
# if your values in the list are already datetime then ignore what is above and do
# df['OPDATE_01'] = df['OPDATE_01'].apply(sorted)
# apply pd.Series to explode your list into columns and then rename col if you want
date = df['OPDATE_01'].apply(pd.Series).rename(columns={0:'OPDATE_01_0', 1:'OPDATE_01_1'})
# merge two frames on ID to maintain all columns
m = df['OPERTN_01'].to_frame().merge(date, left_index=True, right_index=True)
# calc diff and assign to new column
m['diff'] = m.diff(axis=1)['OPDATE_01_1']
OPERTN_01 OPDATE_01_0 OPDATE_01_1 diff
ID
11 [V011, V014] 2016-06-21 2017-02-27 251 days
13 [V011, V014] 2016-01-14 2016-07-14 182 days
14 [V014, V011] 2014-02-25 2014-07-01 126 days
15 [V014, V011] 2014-06-26 2015-04-16 294 days
# just changing variable name to match your comment
df_implants = m
# convert OPERTN_01 to a string
s = df_implants['OPERTN_01'].apply(str)
# boolean indexing to filter df_implants where OPERTN_01 is equal to ['V011', 'V014']
v011v014 = df_implants[(s == "['V011', 'V014']")]
# boolean indexing to filter df_implants where OPERTN_01 is equal to ['V014', 'V011']
v014v011 = df_implants[(s == "['V014', 'V011']")]
OPERTN_01 OPDATE_01_0 OPDATE_01_1 diff
ID
11 [V011, V014] 2016-06-21 2017-02-27 251 days
13 [V011, V014] 2016-01-14 2016-07-14 182 days
OPERTN_01 OPDATE_01_0 OPDATE_01_1 diff
ID
14 [V014, V011] 2014-02-25 2014-07-01 126 days
15 [V014, V011] 2014-06-26 2015-04-16 294 days