我正在使用以下完美工作的功能来解析文本数据,以查找患者病历中动脉狭窄的百分比。
txt <- "Small caliber RCA with 50% proximal and 70% mid stenoses."
coronary_anatomy <- function(x) {
# Check if sentence
if(!is.character(x)) {stop("Requires character string", call. = FALSE)}
# Establish variables
epicardial <- c("LM", "LAD", "LCX", "RCA")
mods <- c("proximal", "mid", "distal", "ostial")
sentence <-
tibble(line = 1, sentence = x) %>%
tidytext::unnest_tokens(input = sentence, output = word, to_lower = FALSE) %>%
pull(word)
# Identify number/locations of disease
artery <- sentence[which(sentence %in% epicardial)]
locs <- grep("\\d+", sentence)
mlocs <- which(sentence %in% mods)
# Find the nearest neighbors to identify which modifier goes with which location
space <- combn(mlocs, length(locs))
dist <- apply(space, 2, function(x) {sum(abs(locs - x))})
matched <- space[, which.min(dist)]
tbl <-
tibble(
anatomy = paste(sentence[matched], artery),
stenosis = as.numeric(sentence[locs])
)
# Return
return(tbl)
}
# Test it out
coronary_anatomy(txt)
Output:
# A tibble: 2 x 2
anatomy stenosis
<chr> <dbl>
1 proximal RCA 50
2 mid RCA 70
代码效果很好。但是现在我遇到了大规模应用它的问题。我想将此代码应用于具有一整个列患者病历的数据框。我要运行该功能的数据帧的简化数据帧如下所示。
# A tibble: 2 x 2
PatientID Records
<chr> <chr>
1 1234 Small caliber RCA with 50% proximal and 70% mid stenoses
2 1235 Small caliber LCX with 40% proximal and 70% mid stenoses
现在是问题所在。我想以某种方式在整个记录列中运行此功能。但是,运行此功能(如上所示)将输出一个小标题,该小标题的大小取决于可解析的信息量。
比我更聪明的人有一个想法,如何在包含病历的数据表的列中的每个单元格上运行此功能,并以有组织的方式输出该输出(假设输出是小标题)?
答案 0 :(得分:0)
如果速度不是问题,则可以使用lapply
或purrr::map
函数(甚至是for循环)遍历数据的每一行,并将每个小节结果保存在{ {1}},然后将小标题列表合并为一个不错的大标题以供使用。例如,
list
如果您使用的是# dplyr and lapply
result_list = lapply(your_data$Records, coronary_anatomy)
names(result_list) = your_data$PatientID
result_tbl = bind_rows(result_list, .id = "PatientID")
result_tbl
# # A tibble: 4 x 3
# PatientID anatomy stenosis
# <chr> <chr> <dbl>
# 1 1234 proximal RCA 50
# 2 1234 mid RCA 70
# 3 1235 proximal LCX 40
# 4 1235 mid LCX 70
1.0版或更高版本,则也可以仅使用dplyr
和group_by
来做到这一点:
summarize