我有一个函数可以获取一些自由文本,然后根据单词列表将文本分成列。它工作正常,但有人告诉我,如果它被矢量化它会更好。
该函数名为Extractor
:
Extractor <- function(x, y, stra, strb, t) {
x <- data.frame(x)
t <- gsub("[^[:alnum:],]", " ", t)
t <- gsub(" ", "", t, fixed = TRUE)
x[, t] <-
stringr::str_extract(x[, y], stringr::regex(paste(stra,
"(.*)", strb, sep = ""),
dotall = TRUE))
x[, t] <- gsub("\\\\.*", "", x[, t])
names(x[, t]) <- gsub(".", "", names(x[, t]), fixed = TRUE)
x[, t] <- gsub(" ", "", x[, t])
x[, t] <- gsub(stra, "", x[, t], fixed = TRUE)
if (strb != "") {
x[, t] <- gsub(strb, "", x[, t], fixed = TRUE)
}
x[, t] <- gsub(" ", "", x[, t])
x[, t] <- ColumnCleanUp(x, t)
return(x)
}
ColumnCleanUp <- function(x, y) {
x <- (data.frame(x))
x[, y] <- gsub("^\\.\n", "", x[, y])
x[, y] <- gsub("^:", "", x[, y])
x[, y] <- gsub(".", "\n", x[, y], fixed = TRUE)
x[, y] <- gsub("\\s{5}", "", x[, y])
x[, y] <- gsub("^\\.", "", x[, y])
x[, y] <- gsub("$\\.", "", x[, y])
return(x[, y])
}
我按如下方式使用它:
HistolTree<-list("Hospital Number","Patient Name","DOB:","General Practitioner:",
"Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:","")
for(i in 1:(length(HistolTree)-1)) {
Mypath<-Extractor(Mypath,"PathReportWhole",as.character(HistolTree[i]),
as.character(HistolTree[i+1]),as.character(HistolTree[i]))
}
示例输入文本是:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood
DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<-"PathReportWhole"
预期输出为:
structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"),
HospitalNumber = " 233456 ", PatientName = " Jonny Begood",
DOB = " 13/01/77 ", GeneralPractitioner = NA_character_,
Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ",
Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ",
Histology = " These show chronic reflux and other bits n bobs\n ",
Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName",
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails",
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")
基本上,我通过循环循环调用该函数(虽然这里只有一个例子,实际的数据帧有> 2000行)。
apply()
是一种以矢量化方式应用函数的方法吗?如果没有,我可以有一个指针如何矢量化这样我可以避免使用循环?我理解向量化函数的想法意味着将函数作为一个整体而不是循环应用于向量,并且我需要将输入列表转换为字符向量,但我从那里被卡住了。
答案 0 :(得分:0)
我认为我试图在某种程度上简化你的各种正则表达式,而不是向你的现有函数进行矢量化。我可以看到你正在做什么,你有一个带有原始病理数据的data.frame,看起来很讨厌,如:
医院编号233456患者姓名:Jonny Begood DOB:13/01/77 全科医生:De'ath博士程序日期:13/01/99临床 细节:Dyaphagia和回流宏观描述:3件 食道,所有良好的活检。组织学:这些显示慢性反流 和其他比特n bobs。诊断:可能是胃酸反流
您使用的是一种很好的方法,即使用标题(“医院编号”,“患者姓名:”,......)来提取数据(“233456”,“Jonny Begood”, ...)。但是,我认为使用正则表达式有一种更简单的方法,即将标题用作lookbehind and lookahead 标记。因此,在上面的字符串中,我们看到医院编号的数据是“医院编号”和“患者姓名:”之间的所有内容,删除了空格,即“233456”。可以应用相同的原理来提取每个后续数据。一些代码行将把不同的数据部分放入data.frame中的各自列中。
以下是创建测试data.frame的代码:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<-"PathReportWhole"
然后我们创建标题的字符向量:
x <- c("Hospital Number", "Patient Name:", "DOB:", "General Practitioner:", "Date of Procedure:", "Clinical Details:", "Macroscopic description:", "Histology:", "Diagnosis:")
请注意,这些必须与完全匹配数据中包含的标头。另外,我们不需要空字符串作为最后一个条目,如上所述。
然后我们可以编写一个函数,它将data.frame df
作为参数,包含原始数据colName
的data.frame列的名称(为了使函数像可能),以及标题headers
的载体。
extractPath <- function(df, colName, headers) {
# df: data.frame containing raw path data
# colName: name of column containing data
# headers: character vector of headers (delimiters in raw path data)
for (i in seq_len(length(headers))) {
# left delimiter
delimLeft <- headers[i]
# right delimiter, not relevant if at end of headers
if (i < length(headers)) {
delimRight <- headers[i+1]
# regex to match everything between delimiting headers
regex <- paste0("(?<=", delimLeft, ").*(?=", delimRight, ")")
} else {
# regex to match everything to right of final delimiting header
regex <- paste0("(?<=", delimLeft, ").*$")
}
# generate column name for new column
# use alpha characters only (i.e. ignore colon), and remove spaces
columnName <- str_extract(delimLeft, "[[:alpha:] ]*") %>% str_replace_all(" ", "")
# create new column of data, and trim whitespace
df[[columnName]] <- str_extract(df[[colName]], regex) %>% str_trim()
}
# return output data.frame
df
}
我在这里使用的是tidverse
软件包生态系统,即dplyr
和stringr
。该函数循环遍历每个标头,生成适当的正则表达式,然后应用这些表达式来创建新的数据列。
这样调用函数:
out <- extractPath(Mypath, "PathReportWhole", x)
这是单行测试data.frame的输出:
> glimpse(out)
Observations: 1
Variables: 10
$ PathReportWhole <fctr> Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and re...
$ HospitalNumber <chr> "233456"
$ PatientName <chr> "Jonny Begood"
$ DOB <chr> "13/01/77"
$ GeneralPractitioner <chr> "Dr De'ath"
$ DateofProcedure <chr> "13/01/99"
$ ClinicalDetails <chr> "Dyaphagia and reflux"
$ Macroscopicdescription <chr> "3 pieces of oesophagus, all good biopsies."
$ Histology <chr> "These show chronic reflux and other bits n bobs."
$ Diagnosis <chr> "Acid reflux likely"
(您可能希望进一步整理数据,转换字符日期等等。)
我还用几千行的data.frame对它进行了测试,并在一秒左右的时间内运行。