R在数据帧内矢量化的字符串中提取模式

时间:2015-11-26 09:28:41

标签: r string replace

我有一个带有'description'列的数据框,我需要将其拆分为'product'和'facetblock'列。我已经模拟了一些虚拟数据来说明我的任务。我(通过使用RWeka :: NGramTokenizer的过程)成功地生成了“产品”列。

数据看起来像......

my_dataframe = data.frame(description = c("ford fiesta blue fast","red toyota japanese very fast","rolls royce phantom black",
                                      "yellow beach buggie with spare wheel","harrier jump jet vertical take off",
                                      "american jeep with seat belt","suzuki motorbike with built in fridge"),
                      product = c("fiesta","red toyota","rolls royce","beach buggie","jump jet","american jeep","motorbike"))

我在下一步感到困惑,非常感谢任何帮助。我试图从'description'中的相对位置中提取'product'中的字符串。为避免疑义,我的目标my_dataframe $ facetblock列看起来像这样......

my_dataframe$facetblock = c("ford blue fast", "japanese very fast", "phantom black", "yellow with spare wheel", "harrier vertical take off", "with seat belt", "suzuki with built in fridge")

我尝试了一些基本的,stringr,stringi和qdap包(grep,str_extract,stri_extract,mgsub)的开箱即用方法,但没有成功。我也试过编写自己的sapply函数,但还没有运气

my_dataframe$facetblock = sapply(mydata, function(x) str_extract(mydata$description[x], mydata$product[x]))

my_dataframe$facetblock = sapply(mydata$description, function(x) grep(mydata$product[x], mydata$description[x], value = TRUE, invert = TRUE))

有人有解决方案可以与我分享吗? Thnx提前。

1 个答案:

答案 0 :(得分:2)

您可以使用stri::stri_replace_first_fixed对此进行矢量化(您也可以更改为last / all

library(stringi) 
with(my_dataframe, stri_replace_first_fixed(description, product, ""))
# [1] "ford  blue fast"              " japanese very fast"          " phantom black"               "yellow  with spare wheel"    
# [5] "harrier  vertical take off"   " with seat belt"              "suzuki  with built in fridge"

如果您不喜欢前导空格,可以将其包装到新的trimwsstringi::stri_trim(根据@akruns评论)功能。