我想在发生DateTime时拆分文本并形成新行。原始数据框看起来像
Patient_id | Issue
-----------------------------------------------------------------------
1 |12-02-2018 12:15:52 -abc-Patient have headache 20-02-2018 2:15:52 -abc- Previous medication
|had some side effects 20-03-2018 5:30:52 -abc- Patient got cured xyz worked well.
-----------------------------------------------------------------------
2 | 19-02-2018 2:50:52 -cbf- cbf is allergic def medicine and
| have a fever with a body ache 25-02-2018 2:50:52 -cbf-
| Patient got cured by def medicine.
我尝试了以下操作,但未获得预期的结果
df = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} - .*\s+)',df['Issue'], re.DOTALL),columns = ['Issue'])
预期结果
Patient_id | Issue
-----------------------------------------------------------------------
1 |12-02-2018 12:15:52 -abc-Patient have headache
| 20-02-2018 2:15:52 -abc- Previous medication had some side effects.
|20-03-2018 5:30:52 -abc-Patient got cured xyz worked well.
-----------------------------------------------------------------------
2 | 19-02-2018 2:50:52 -cbf- cbf is allergic def medicine and have fever with bodyache
|25-02-2018 2:50:52 -cbf-Patient got cured by def medicine.
然后,我尝试将信息拆分为不同的列,例如数字,日期,名称和问题。
df = df.Issue.str.split('-',n=3)
df = pd.DataFrame(df.values.tolist(),columns=['Number', 'Date','Name', 'Issue'])
预期的最终产量
Patient_id|Name|Date of admission |Issue
------------------------------------------------------------------------------------------
1 |abc |12-02-2018 12:15:52|Patient have headache
------------------------------------------------------------------------------------------
1 |abc |20-02-2018 2:15:52 |Previous medication had some side effects.
-------------------------------------------------------------------------------------------
1 |abc |20-03-2018 5:30:52 |Patient got cured xyz worked well.
-------------------------------------------------------------------------------------------
2 |cbf |19-02-2018 2:50:52 |cbf is allergic def medicine and have to fever with a body ache
--------------------------------------------------------------------------------------------
2 |cbf |25-02-2018 2:50:52 |Patient got cured by def medicine.
答案 0 :(得分:1)
您可以尝试使用此正则表达式进行拆分
(\d{2}-\d{2}-\d{4}[ ]\d{1,2}:\d{1,2}:\d{1,2}(?:(?!\d{2}-\d{2}-\d{4}[ ]\d{1,2}:\d{1,2}:\d{1,2})[\S\s])*)
请注意,时间戳记具有可变数字1或2
https://regex101.com/r/e9kPsd/1
扩展
( # (1 start)
\d{2} - \d{2} - \d{4} [ ] \d{1,2} : \d{1,2} : \d{1,2}
(?:
(?! \d{2} - \d{2} - \d{4} [ ] \d{1,2} : \d{1,2} : \d{1,2} )
[\S\s]
)*
) # (1 end)
答案 1 :(得分:1)
这假定“名称/问题”不包含任何数字:
s = df.set_index('Patent_id').Issue
s.str.extractall('(?P<Date>\d*-\d*-\d* \d*:\d*:\d*)\s+-(?P<Name>\w*)-(?P<Issue>[\D\s]*)')
给予:
Date Name Issue
Patent_id match
1 0 12-02-2018 12:15:52 abc Patient have headache
1 20-02-2018 2:15:52 abc Previous medication had some side effects
2 20-03-2018 5:30:52 abc Patient got cured xyz worked well.
2 0 19-02-2018 2:50:52 cbf cbf is allergic def medicine and have a fever...
1 25-02-2018 2:50:52 cbf Patient got cured by def medicine.