我有一些文字如下:
inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance: Good Complications: None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"
我想根据我设置的某些文本边界将测试的一部分提取到自己的列中:
myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")
并不是所有的分隔符都在文本中(但是它们总是相同的顺序)。
我有一个函数应该将它们分开(以列标题为单词边界的开头:
delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)
inputStringdf <- inputStringdf %>%
tidyr::separate(inputString, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
但是,当在两个定界符之间找不到结果时,或者如果不存在定界符,则不将NA放在列中,而只是用在两个定界符之间找到的下一个文本填充它。如何确保正确的列中填充了由定界符定义的正确文本?
答案 0 :(得分:1)
使用最后在注释中显示的输入将其转换为DCF格式,然后使用read.dcf
进行读取,该输入将输入行转换为字符矩阵m
。有关更多信息,请参见?read.dcf
。不使用任何软件包。
pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))
以下是前三列:
m[, 1:3]
## Patient Name Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981" "000000"
## [2,] "MRS Comfor Atest" NA "000000"
假定此输入对每个患者都有一条记录,例如本例中有两条记录。为了简化合成输入数据集,我们只重复了第一位患者,只是我们在第二条记录中省略了出生日期。
Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))