我有一个data
数据框,大约有70列。我感兴趣的是列IDX
(这是每条记录的唯一标识符)和Text
(其中包含非常长的字符串,除了获取日期之外不是很有用。任务是获取日期,确保它们有效并为每个日期创建一个列。通常,每个IDX
Text
对有0到4个日期
这是我迄今为止所做的,并且它需要永远运行,我需要一个更好的解决方案。
data:
IDX RID Text
100 10 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
101 20 7/17/06-advil, qui;
102 10 7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
103 40 9/26/06-penicilin, tramadol;
104 91 5/23/06-penicilin, amoxicilin, tylenol;
105 84 10/20/06-ibuprofen, tramadol;
106 17 12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
107 23 12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
108 15 Follow up appt. scheduled
109 69 talk to care giver
110 32 12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
111 70 12/1/06?Follow up but no serious allergies
112 70 12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil
data_Dict = data.set_index('IDX')['Text'].to_dict()
def find_Date(df, data_Dict):
Dates ={}
for k, v in data_Dict.items():
date_V = v
matches = list(datefinder.find_dates(date_V))
if len(matches) > 0:
date_ = [format(matches[i], "%m/%d/%Y") for i in range(0,len(matches))]
else:
date_ = []
date_.sort()
Dates[k] = ', '.join([str(dates) for dates in date_])
df['Dates'] = df['IDX'].map(Dates)
date_types = pd.to_datetime(df["Dates"], errors='coerce')
try:
if date_types[0]:
df['Date1'] = df['IDX'].map(date_types[0])
elif date_types[1]:
df['Date2'] = df['IDX'].map(date_types[1])
elif date_types[2]:
df['Date3'] = df['IDX'].map(date_types[2])
elif date_types[3]:
df['Date4'] = df['IDX'].map(date_types[3])
except:
print ("invalid date")
df = df.drop('Dates', 1)
仍然无法生成输出......
def find_Date_(df):
pd.to_datetime(df.set_index('IDX')['Text'].str.extractall('(\d{1,2}[-/]\d{1,2}[-/]\d{2})')[0],errors='coerce').dropna().unstack().rename(columns=lambda x: x + 1).add_prefix('Date')
find_Date_(data)
谢谢大家!
答案 0 :(得分:1)
仍然不确定你在追求什么...
...但是这会查找可能是日期的所有内容,尝试解析它,然后返回第一个成功解析的内容。
pd.to_datetime(
data.set_index('IDX')['Text'].str.extractall(
'(\d{1,2}[-/]\d{1,2}[-/]\d{2})'
)[0],
errors='coerce'
).dropna().unstack()[0]
IDX
100 2006-06-26
101 2006-07-17
102 2006-07-19
103 2006-09-26
104 2006-05-23
105 2006-10-20
106 2006-12-19
107 2006-12-19
110 2006-12-15
111 2006-12-01
112 2006-12-12
Name: 0, dtype: datetime64[ns]
保留所有解析日期
pd.to_datetime(
data.set_index('IDX')['Text'].str.extractall(
'(\d{1,2}[-/]\d{1,2}[-/]\d{2})'
)[0],
errors='coerce'
).dropna().unstack()
match 0 1 2 3
IDX
100 2006-06-26 NaT NaT NaT
101 2006-07-17 NaT NaT NaT
102 2006-07-19 2006-08-31 NaT NaT
103 2006-09-26 NaT NaT NaT
104 2006-05-23 NaT NaT NaT
105 2006-10-20 NaT NaT NaT
106 2006-12-19 2009-12-01 2010-06-18 2011-03-07
107 2006-12-19 2009-12-01 2010-06-18 2011-03-07
110 2006-12-15 2007-02-16 2016-06-08 NaT
111 2006-12-01 NaT NaT NaT
112 2006-12-12 2007-01-26 NaT NaT
获取所需的列名称
pd.to_datetime(
data.set_index('IDX')['Text'].str.extractall(
'(\d{1,2}[-/]\d{1,2}[-/]\d{2})'
)[0],
errors='coerce'
).dropna().unstack().rename(columns=lambda x: x + 1).add_prefix('Date')
match Date1 Date2 Date3 Date4
IDX
100 2006-06-26 NaT NaT NaT
101 2006-07-17 NaT NaT NaT
102 2006-07-19 2006-08-31 NaT NaT
103 2006-09-26 NaT NaT NaT
104 2006-05-23 NaT NaT NaT
105 2006-10-20 NaT NaT NaT
106 2006-12-19 2009-12-01 2010-06-18 2011-03-07
107 2006-12-19 2009-12-01 2010-06-18 2011-03-07
110 2006-12-15 2007-02-16 2016-06-08 NaT
111 2006-12-01 NaT NaT NaT
112 2006-12-12 2007-01-26 NaT NaT