创建一个查找日期并将其拆分为pandas dataframe列的函数

时间:2017-07-18 18:55:39

标签: python pandas datetime dictionary

我有一个data数据框,大约有70列。我感兴趣的是列IDX(这是每条记录的唯一标识符)和Text(其中包含非常长的字符串,除了获取日期之外不是很有用。任务是获取日期,确保它们有效并为每个日期创建一个列。通常,每个IDX Text对有0到4个日期

这是我迄今为止所做的,并且它需要永远运行,我需要一个更好的解决方案。

data:
IDX    RID      Text
100    10      6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
101    20      7/17/06-advil, qui;
102    10      7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
103    40      9/26/06-penicilin, tramadol;
104    91      5/23/06-penicilin, amoxicilin, tylenol;
105    84      10/20/06-ibuprofen, tramadol;
106    17      12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
107    23      12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
108    15      Follow up appt. scheduled
109    69      talk to care giver
110    32      12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
111    70      12/1/06?Follow up but no serious allergies
112    70      12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil

data_Dict = data.set_index('IDX')['Text'].to_dict()
def find_Date(df, data_Dict):
    Dates ={}
    for k, v in data_Dict.items():
        date_V = v
        matches = list(datefinder.find_dates(date_V))
        if len(matches) > 0:
            date_ = [format(matches[i], "%m/%d/%Y") for i in range(0,len(matches))]
        else:
            date_ = []
        date_.sort()
        Dates[k] = ', '.join([str(dates) for dates in date_])
        df['Dates'] = df['IDX'].map(Dates)
        date_types = pd.to_datetime(df["Dates"], errors='coerce')
        try:
            if date_types[0]:
                df['Date1'] = df['IDX'].map(date_types[0])
            elif date_types[1]:
                df['Date2'] = df['IDX'].map(date_types[1])
            elif date_types[2]:
                df['Date3'] = df['IDX'].map(date_types[2])
            elif date_types[3]:
                df['Date4'] = df['IDX'].map(date_types[3])
        except:
            print ("invalid date")
        df = df.drop('Dates', 1)

仍然无法生成输出......

def find_Date_(df):
    pd.to_datetime(df.set_index('IDX')['Text'].str.extractall('(\d{1,2}[-/]\d{1,2}[-/]\d{2})')[0],errors='coerce').dropna().unstack().rename(columns=lambda x: x + 1).add_prefix('Date')

find_Date_(data)

谢谢大家!

1 个答案:

答案 0 :(得分:1)

仍然不确定你在追求什么...
...但是这会查找可能是日期的所有内容,尝试解析它,然后返回第一个成功解析的内容。

pd.to_datetime(
    data.set_index('IDX')['Text'].str.extractall(
        '(\d{1,2}[-/]\d{1,2}[-/]\d{2})'
    )[0],
    errors='coerce'
).dropna().unstack()[0]

IDX
100   2006-06-26
101   2006-07-17
102   2006-07-19
103   2006-09-26
104   2006-05-23
105   2006-10-20
106   2006-12-19
107   2006-12-19
110   2006-12-15
111   2006-12-01
112   2006-12-12
Name: 0, dtype: datetime64[ns]

保留所有解析日期

pd.to_datetime(
    data.set_index('IDX')['Text'].str.extractall(
        '(\d{1,2}[-/]\d{1,2}[-/]\d{2})'
    )[0],
    errors='coerce'
).dropna().unstack()

match          0          1          2          3
IDX                                              
100   2006-06-26        NaT        NaT        NaT
101   2006-07-17        NaT        NaT        NaT
102   2006-07-19 2006-08-31        NaT        NaT
103   2006-09-26        NaT        NaT        NaT
104   2006-05-23        NaT        NaT        NaT
105   2006-10-20        NaT        NaT        NaT
106   2006-12-19 2009-12-01 2010-06-18 2011-03-07
107   2006-12-19 2009-12-01 2010-06-18 2011-03-07
110   2006-12-15 2007-02-16 2016-06-08        NaT
111   2006-12-01        NaT        NaT        NaT
112   2006-12-12 2007-01-26        NaT        NaT

获取所需的列名称

pd.to_datetime(
    data.set_index('IDX')['Text'].str.extractall(
        '(\d{1,2}[-/]\d{1,2}[-/]\d{2})'
    )[0],
    errors='coerce'
).dropna().unstack().rename(columns=lambda x: x + 1).add_prefix('Date')

match      Date1      Date2      Date3      Date4
IDX                                              
100   2006-06-26        NaT        NaT        NaT
101   2006-07-17        NaT        NaT        NaT
102   2006-07-19 2006-08-31        NaT        NaT
103   2006-09-26        NaT        NaT        NaT
104   2006-05-23        NaT        NaT        NaT
105   2006-10-20        NaT        NaT        NaT
106   2006-12-19 2009-12-01 2010-06-18 2011-03-07
107   2006-12-19 2009-12-01 2010-06-18 2011-03-07
110   2006-12-15 2007-02-16 2016-06-08        NaT
111   2006-12-01        NaT        NaT        NaT
112   2006-12-12 2007-01-26        NaT        NaT