从电子邮件主题行中提取特定患者ID

时间:2017-02-21 01:11:13

标签: r textmatching data-scrubbing

我希望从电子邮件主题行中提取患者ID。我正在处理两个数据框:一个是SQL数据库的输出(包含电子邮件主题行),另一个是患者信息(医院名称和患者ID)。

我想使用患者ID并从第一个数据框中擦除主题行并返回与所述患者相关的医院。不幸的是我无法提供对数据的访问。

## Example Data

Data frame 1 example row:

Column 1 (from_Email): xxxxx@hospital.com 

Column 2 (Time_IN): 1/11/2000 12:00:00

Column 3 (from_Subject): Patient H2445JFLD presented into ER with .... symptoms

Data frame 2 example row:

Column 1 (Hospital Name): Hospital ABC

Column 2 (Patient ID): H2445JFLD 

1 个答案:

答案 0 :(得分:1)

由于您只共享了一行数据,因此我不确定电子邮件主题行from_Subject的模式。如果是自动电子邮件系统,则会有固定模式的电子邮件主题行from_Subject。我已经为您提供了三种从Patient_ID中提取from_Subject的方法。

library(dplyr)

df1 <- data_frame(from_Email = "xxxxx@hospital.com",
                  Time_IN = "1/11/2000 12:00:00",
                  from_Subject = "Patient H2445JFLD presented into ER with .... symptoms")

df2 <- data_frame(Hospital_Name = "Hospital ABC",
                  Patient_ID = "H2445JFLD")

# Extract 2nd word from the subject line
df1 <- df1 %>% mutate(Patient_ID = stringr::word(from_Subject, 2))
# Extract the word after "Patient" from the subject line
df1 <- df1 %>% mutate(Patient_ID = str_extract(df1$from_Subject, '(?<=Patient\\s)\\w+'))
# Extract a word of length 9 that has characters A-Z and 0-9 from the subject line
df1 <- df1 %>% mutate(Patient_ID = str_extract(df1$from_Subject, '\\b[A-Z0-9]{9}\\b'))

一旦您提取了Patient_ID,那么这是一个您需要做的简单左连接。

left_join(df1, df2, on="Patient_ID")
#Joining, by = "Patient_ID"
# A tibble: 1 × 5
#  from_Email            Time_IN         from_Subject                                            Patient_ID Hospital_Name
#  <chr>                 <chr>            <chr>                                                  <chr>       <chr>
#1 xxxxx@hospital.com 1/11/2000 12:00:00 Patient H2445JFLD presented into ER with .... symptoms  H2445JFLD  Hospital ABC