非常感谢对我的问题的任何帮助,谢谢。
我有一个数据框,其中第二列已经从第一列(在前面的步骤中)中提取了“选定的”单词,这些单词现在经常(但不总是)以不同的运行顺序保留它们。我现在需要按照'wordsDF $ original'列中的相同运行顺序获取'wordsDF $ subbed'列中的单词。
我已经发布了一个小子集,用第四列(wordsDF $ target)来说明,我已经手工完成了演示我的目标。
我正在尝试使用sapply()按照'wordsDF $ original'中找到的顺序创建第三列(wordsDF $ reord),即'wordsDF $ subbed'。关于如何沿着具有不同长度(即每个字符串中的字数)的wordsDF $ original字符串的所有单词传递sapply函数,我感到困惑。我能想到实现这一目标的唯一方法是使用stringr函数str_detect来检测(从左到右)wordsDF $ original中的每个单词是否在wordsDF $ subbed中,如果是'yes'则将该单词提取到wordsDF $ reord中(粘贴已经提取的任何东西)。如果为'no',则列单词DF $ reord保持不变。
我的解决方案如下,但是,它只是检查和提取第一个单词的硬编码。任何人都可以告诉我如何沿着每个字符串传递函数吗?或者是否有更好的方法可以重新排序单词DF $ subbed并且不需要使用单词DF $ reord?
library(stringr)
original = c("heat pump only for 100/150l geyser r410a gas",
"alliance allwh 5_dcpt_0kw heat pump only for 200/25",
"alliance allwinteg 190l integrated heat pump and cylinder r134a gas",
"aquatouch bt10 cp bottle trap 32x40",
"aquatouch pop32lux cp slotted pop up basin waste 32mm",
"aquatouch ci15 cp angle regulating valve only 15x15")
subbed = c("heat pump",
"heat pump",
"and cylinder heat pump",
"bottle trap",
"basin pop up waste",
"valve")
wordsDF = as.data.frame(cbind(original, subbed))
wordsDF$original = as.character(wordsDF$original)
wordsDF$subbed = as.character(wordsDF$subbed)
wordsDF$reord = character(nrow(wordsDF))
wordsDF$target = c("heat pump","heat pump",
"heat pump and cylinder",
"bottle trap","pop up basin waste",
"valve")
# my attempted solution...
wordsDF$reord = sapply(wordsDF$original, function(x) ifelse(
test = str_detect(wordsDF$subbed, word(wordsDF$original, 1,1)),
yes = paste(wordsDF$reord, str_extract(wordsDF$subbed, word(wordsDF$original, 1,1))),
no = wordsDF$reord))
提前感谢!
答案 0 :(得分:2)
这是一个可能的基本R解决方案,它在两个拆分向量上运行mapply
,并以正确的顺序返回两个匹配的单词,并包含在paste
Rematch <- function(x, y) paste(y[sort(match(x, y))], collapse = " ") # Define an helper functions
mapply(Rematch, strsplit(subbed, "\\s+"), strsplit(original, "\\s+"))
# [1] "heat pump" "heat pump" "heat pump and cylinder" "bottle trap" "pop up basin waste"
# [6] "valve"