背景
我有一个如下所示的数据框(made of synthetic data for those who are interested)。它由半结构化文本组成。文本由标题分隔。标头标题始终相同,但某些标头有时不显示在报告中(但全部以相同的顺序出现)。
数据
let
我当前的解决方案
我创建了一个函数,该函数根据字符定界符(标题的名称)列表提取文本。
此刻,它获取保存在(x)中的数据框以及文本列(y),以及开始标头和结束标头,最后创建列标头(即开始标题)。
这行得通,我认为:
structure(list(OGDReportWhole = c("Hospital: Random NHS Foundation Trust\nHospital Number: J6044658\nPatient Name: Jargon, Victoria\nGeneral Practitioner: Dr. Martin, Marche\nDate of procedure: 2009-11-11\nEndoscopist: Dr. Sullivan, Shelby\nSecond endoscopist: Dr. al-Basha, Mahfoodha\nMedications: Fentanyl 12.5mcg\nMidazolam 6mg\nInstrument: FG5\nExtent of Exam: GOJ\nIndications: Follow-up ULCER HEALING\nProcedure Performed: Gastroscopy (OGD)\nFindings: No evidence of Barrett's oesophagus, short 2 cn hiatus hernia.,Oesophageal biopsies taken from three levels as requested.,OGD today to assess for ulceration/ongoing bleeding.,Diaphragmatic pinch:40cm .,She has a small hiatus hernia .,We will re-book for 2 weeks, rebanding.,Tiny erosions at the antrum.,Biopsies taken from top of stricture-metal marking clips in situ.,The varices flattened well with air insufflation.,He is on Barrett's Screeling List in October 2017 at St Thomas'.\nHALO 90 done with good effect\nEndoscopic Diagnosis: Post chemo-radiotherapy stricture ",
"Hospital: Random NHS Foundation Trust\nHospital Number: Y6417773\nPatient Name: Powell, Destiny\nGeneral Practitioner: Dr. al-Safi, Lutfiyya\nDate of procedure: 2008-06-15\nEndoscopist: Dr. Kekich, Annabelle\nSecond endoscopist: Dr. Needham, April\nMedications: Fentanyl 125mcg\nMidazolam 7mg\nInstrument: FG6\nExtent of Exam: Pylorus\nIndications: Weight Loss\nProcedure Performed: Gastroscopy (OGD)\nFindings: Duodenum: Duodenitis with a small erosion .,STOMACH: diffuse gastritis with angiodysplasia and punctate bleeding site on greater curve mid body - no obvious ulcer- antrum scar ?,No immediate complications.,Z-line at: 38cm - Bravo placed at 32cm- good positionat check endoscopy.\n\nEndoscopic Diagnosis: Esophageal candidiasis "
)), row.names = 1:2, class = "data.frame")
我反复运行它:
#' @param x the dataframe
#' @param y the column to extract from
#' @param stra the start of the boundary to extract
#' @param strb the end of the boundary to extract
#' @param t the column name to create
Extractor2 <- function(x, y, stra, strb, t) {
x <- data.frame(x)
t <- gsub("[^[:alnum:],]", " ", t)
t <- gsub(" ", "", t, fixed = TRUE)
x[, t] <- stringr::str_extract(x[, y], stringr::regex(paste(stra,
"(.*)", strb, sep = ""), dotall = TRUE))
x[, t] <- gsub("\\\\.*", "", x[, t])
names(x[, t]) <- gsub(".", "", names(x[, t]), fixed = TRUE)
x[, t] <- gsub(" ", "", x[, t])
x[, t] <- gsub(stra, "", x[, t], fixed = TRUE)
if (strb != "") {
x[, t] <- gsub(strb, "", x[, t], fixed = TRUE)
}
x[, t] <- gsub(" ", "", x[, t])
x[, t]<- ColumnCleanUp(x[, t])
return(x)
}
问题
我想让函数只接受一个字符串(而不是一个数据框,然后是列名),然后将其添加到一个空的数据框(包括原始字符串)中。
我不确定如何将函数从获取数据帧并添加到该数据帧转换为将inputString添加到空数据帧。我希望它创建与当前函数相同的输出。
我很乐意对功能进行一般性的批评,如果有更好的方法可以实现我正在尝试的功能
**答案*
好的,感谢@ M-M ...我有点慢。.
答案很简单。只需使用定界符列表创建一个空的数据框,然后从那里开始...
EndoscTree<-list('Hospital Number:','Patient Name:','General Practitioner:',
'Date of procedure:','Endoscopist:','Second Endoscopist:','Medications',
'Instrument','Extent of Exam:','Indications:','Procedure Performed:',
'Findings:','Endoscopic Diagnosis:')
for(i in 1:(length(EndoscTree)-1)) {
Mydata<-Extractor2(Mydata,'OGDReportWhole',as.character(EndoscTree[i]),
as.character(EndoscTree[i+1]),as.character(EndoscTree[i]))
}