如何为datetime的每个新条目将行拆分为新行?

时间:2019-09-09 19:26:35

标签: regex python-3.x pandas nlp

我想在发生DateTime时拆分文本并形成新行。原始数据框看起来像

Patient_id   |        Issue  
 -----------------------------------------------------------------------
 1           |12-02-2018 12:15:52 -abc-Patient have headache 20-02-2018 2:15:52 -abc- Previous medication 
             |had some side effects 20-03-2018 5:30:52 -abc- Patient got cured xyz worked well.
  -----------------------------------------------------------------------              
 2           | 19-02-2018 2:50:52 -cbf- cbf is allergic def medicine and 
             |  have a fever with a body ache 25-02-2018 2:50:52 -cbf-
             |  Patient got cured by def medicine.

我尝试了以下操作,但未获得预期的结果

df = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} - .*\s+)',df['Issue'], re.DOTALL),columns = ['Issue'])

预期结果

Patient_id   |        Issue  
-----------------------------------------------------------------------
 1           |12-02-2018 12:15:52 -abc-Patient have headache 
             | 20-02-2018 2:15:52 -abc- Previous medication had some side effects.
             |20-03-2018 5:30:52 -abc-Patient got cured xyz worked well.
-----------------------------------------------------------------------              
 2            | 19-02-2018 2:50:52 -cbf- cbf is allergic def medicine and have fever with bodyache
              |25-02-2018 2:50:52 -cbf-Patient got cured by def medicine.

然后,我尝试将信息拆分为不同的列,例如数字,日期,名称和问题。

df = df.Issue.str.split('-',n=3)

df = pd.DataFrame(df.values.tolist(),columns=['Number', 'Date','Name', 'Issue'])

预期的最终产量

Patient_id|Name|Date of admission  |Issue  
------------------------------------------------------------------------------------------
1         |abc |12-02-2018 12:15:52|Patient have headache 
------------------------------------------------------------------------------------------
1         |abc |20-02-2018 2:15:52 |Previous medication had some side effects.
-------------------------------------------------------------------------------------------
1         |abc |20-03-2018 5:30:52 |Patient got cured xyz worked well.
-------------------------------------------------------------------------------------------                
2         |cbf |19-02-2018 2:50:52 |cbf is allergic def medicine and have to fever with a body ache
 --------------------------------------------------------------------------------------------
2         |cbf |25-02-2018 2:50:52 |Patient got cured by def medicine.

2 个答案:

答案 0 :(得分:1)

您可以尝试使用此正则表达式进行拆分

(\d{2}-\d{2}-\d{4}[ ]\d{1,2}:\d{1,2}:\d{1,2}(?:(?!\d{2}-\d{2}-\d{4}[ ]\d{1,2}:\d{1,2}:\d{1,2})[\S\s])*)

请注意,时间戳记具有可变数字1或2

https://regex101.com/r/e9kPsd/1

扩展

 (                             # (1 start)
      \d{2} - \d{2} - \d{4} [ ] \d{1,2} : \d{1,2} : \d{1,2} 
      (?:
           (?! \d{2} - \d{2} - \d{4} [ ] \d{1,2} : \d{1,2} : \d{1,2} )
           [\S\s] 
      )*
 )                             # (1 end)

答案 1 :(得分:1)

这假定“名称/问题”不包含任何数字:

s = df.set_index('Patent_id').Issue

s.str.extractall('(?P<Date>\d*-\d*-\d* \d*:\d*:\d*)\s+-(?P<Name>\w*)-(?P<Issue>[\D\s]*)')

给予:

                    Date                Name    Issue
Patent_id   match           
1           0       12-02-2018 12:15:52 abc     Patient have headache
            1       20-02-2018 2:15:52  abc     Previous medication had some side effects
            2       20-03-2018 5:30:52  abc     Patient got cured xyz worked well.
2           0       19-02-2018 2:50:52  cbf     cbf is allergic def medicine and have a fever...
            1       25-02-2018 2:50:52  cbf     Patient got cured by def medicine.
相关问题