根据定界符提取文本,但漏掉丢失的条目

时间:2019-05-03 09:51:51

标签: r

我有一些文字如下:

 inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance:  Good Complications:  None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied  DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"

我想根据我设置的某些文本边界将测试的一部分提取到自己的列中:

  myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")

并不是所有的分隔符都在文本中(但是它们总是相同的顺序)。

我有一个函数应该将它们分开(以列标题为单词边界的开头:

delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)

  inputStringdf <- inputStringdf %>%
    tidyr::separate(inputString, into = c("added_name",delim),
                    sep = paste(delim, collapse = "|"),
                    extra = "drop", fill = "right")

但是,当在两个定界符之间找不到结果时,或者如果不存在定界符,则不将NA放在列中,而只是用在两个定界符之间找到的下一个文本填充它。如何确保正确的列中填充了由定界符定义的正确文本?

1 个答案:

答案 0 :(得分:1)

使用最后在注释中显示的输入将其转换为DCF格式,然后使用read.dcf进行读取,该输入将输入行转换为字符矩阵m。有关更多信息,请参见?read.dcf。不使用任何软件包。

pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))

以下是前三列:

m[, 1:3]
##      Patient Name       Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981"  "000000"       
## [2,] "MRS Comfor Atest" NA            "000000"    

注意

假定此输入对每个患者都有一条记录,例如本例中有两条记录。为了简化合成输入数据集,我们只重复了第一位患者,只是我们在第二条记录中省略了出生日期。

Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))