我正在尝试基于模糊匹配来查找和替换一些文本,如下所示。
目标
我要这样做以查找和替换列表。我不知道如何扩展当前功能以允许发生这种情况。
输入
输入文本 df <- data.frame(textcol=c("In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here","No mention of this radifreq7uency ablati0on thing"))
尝试
##### Lower case the text ##########
df$textcol<-tolower(df$textcol)
#Need to define the pattern to match and what to replace it with
matchPattern <- "radiofrequency ablation"
findAndReplace<-function(matchPattern,rawText,replace)
{
positions <- aregexec(matchPattern, rawText, max.distance = 0.1)
regmatches(rawText, positions)
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
#################### Term mapping ####################
df$out <- Vectorize(gsub)(unlist(res), replace, rawText)
df$out
}
matchPatternRFA <- c("radiofrequency ablation")
repRF<-findAndReplace(matchPatternRFA,rawText,"RFA")
repRF
问题 上面的方法可以很好地代替一个术语,但是如果我还想用“ EMR”替换内窥镜的“粘膜下切除”以及用“导管”替换“ HALO”怎么办?
理想情况下,我想创建一个要匹配的术语列表,但是我又如何指定如何替换它们呢?
答案 0 :(得分:1)
定义asub
以用替换字符串替换近似匹配项,并定义一个匹配列表L
,该列表为每个名称定义其替换。然后运行Reduce
进行替换。
asub <- function(pattern, replacement, x, fixed = FALSE, ...) {
m <- aregexec(pattern, x, fixed = fixed)
r <- regmatches(x, m)
lens <- lengths(r)
if (all(lens == 0)) return(x) else
replace(x, lens > 0, mapply(sub, r[lens > 0], replacement, x[lens > 0]))
}
L <- list("radiofrequency ablation" = "RFA",
"endoscopic submucosal resection" = "EMR",
"HALO" = "cathetar")
Reduce(function(x, nm) asub(nm, L[[nm]], x), init = df$textcol, names(L))
给予:
[1] "In this substring would like to find the RFA of this cathetar"
[2] "I like to do EMR and also RFA"
[3] "No match here"
[4] "No mention of this RFA thing"
答案 1 :(得分:0)
您可以创建具有模式和必要替换的查找表:
dt <-
data.table(
textcol = c(
"In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here",
"No mention of this radifreq7uency ablati0on thing"
)
)
dt_gsub <- data.table(
textcol = c("submucosal resection",
"HALO",
"radiofrequency ablation"),
textcol2 = c("EMR", "catheter", "RFA")
)
for (i in 1:nrow(dt))
for (j in 1:nrow(dt_gsub))
dt[i]$textcol <-
gsub(dt_gsub[j, textcol], dt_gsub[j, textcol2], dt[i, textcol])