Question

我的数据框有各种各样的字符串。参见样本df：

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
    df <- data.frame(strings, stringsAsFactors = F)

我希望将句子中的第一个单词和倒数第二个单词隔离开来。倒数第二个将始终在＆＃34;付款之前。＆＃34;

这是我想要的df的样子：

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)

生成的字符串不需要区分大小写。

我能够编写代码来获取句子中的第一个单词（在空格处分开），但无法弄清楚如何将单词拉到左侧（或右侧，就此而言）参考词，＆＃34;付款＆＃34;在这种情况下。

Answer 1

df$QualityWord = sub("(\\w+).*?$", "\\1", df$strings)
df$PaymentWord = sub(".*?(\\w+) payment$", "\\1", df$strings)

df
#>                                     strings QualityWord PaymentWord
#> 1  Average complications and higher payment     Average      higher
#> 2 Average complications and average payment     Average     average
#> 3   Average complications and lower payment     Average       lower
#> 4      Average mortality and higher payment     Average      higher
#> 5      Better mortality and average payment      Better     average

正则表达式术语解释：

(\\w+) =匹配单词字符一次或多次，作为一组捕获
.*? =匹配任何内容，非贪婪
payment =匹配一个空格，然后匹配字符payment
$ =匹配字符串的结尾。
\\1 =将模式替换为第一组中的内容。

Answer 2

我们可以使用extract

中的tidyr

library(tidyverse)
df %>%
   extract(strings, into = c("QaulityWord", "PaymentWord"),
           "^(\\w+).*\\b(\\w+)\\s+\\w+$", remove = FALSE)
#                                   strings QaulityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

Answer 3

使用strsplit，head和tail函数：

outDF = do.call(rbind,lapply(DF$strings,function(x) {

#split string
strObj = unlist(strsplit(x,split=" "))

#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE) 

}))

outDF
#                                    strings QualityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

或

使用dplyr和自定义函数：

customFn = function(x) { strObj = unlist(strsplit(x,split=" ")); outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE); } DF %>% dplyr::rowwise() %>% dplyr::do(customFn(.$strings))

如何隔离指定单词旁边的单词

3 个答案: