如何隔离指定单词旁边的单词

时间:2017-08-17 04:23:45

标签: r string stringr

我的数据框有各种各样的字符串。参见样本df:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
    df <- data.frame(strings, stringsAsFactors = F)

我希望将句子中的第一个单词和倒数第二个单词隔离开来。倒数第二个将始终在&#34;付款之前。&#34;

这是我想要的df的样子:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)

生成的字符串不需要区分大小写。

我能够编写代码来获取句子中的第一个单词(在空格处分开),但无法弄清楚如何将单词拉到左侧(或右侧,就此而言)参考词,&#34;付款&#34;在这种情况下。

3 个答案:

答案 0 :(得分:1)

df$QualityWord = sub("(\\w+).*?$", "\\1", df$strings)
df$PaymentWord = sub(".*?(\\w+) payment$", "\\1", df$strings)

df
#>                                     strings QualityWord PaymentWord
#> 1  Average complications and higher payment     Average      higher
#> 2 Average complications and average payment     Average     average
#> 3   Average complications and lower payment     Average       lower
#> 4      Average mortality and higher payment     Average      higher
#> 5      Better mortality and average payment      Better     average

正则表达式术语解释:

  • (\\w+) =匹配单词字符一次或多次,作为一组捕获
  • .*? =匹配任何内容,非贪婪
  • payment =匹配一个空格,然后匹配字符payment
  • $ =匹配字符串的结尾。
  • \\1 =将模式替换为第一组中的内容。

答案 1 :(得分:1)

我们可以使用extract

中的tidyr
library(tidyverse)
df %>%
   extract(strings, into = c("QaulityWord", "PaymentWord"),
           "^(\\w+).*\\b(\\w+)\\s+\\w+$", remove = FALSE)
#                                   strings QaulityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

答案 2 :(得分:0)

使用strsplitheadtail函数:

outDF = do.call(rbind,lapply(DF$strings,function(x) {

#split string
strObj = unlist(strsplit(x,split=" "))

#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE) 

}))

outDF
#                                    strings QualityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

使用dplyr和自定义函数:

customFn = function(x) { 
strObj = unlist(strsplit(x,split=" ")); 
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}

DF %>% 
dplyr::rowwise() %>% 
dplyr::do(customFn(.$strings))