我的数据框有各种各样的字符串。参见样本df:
strings <- c("Average complications and higher payment",
"Average complications and average payment",
"Average complications and lower payment",
"Average mortality and higher payment",
"Better mortality and average payment")
df <- data.frame(strings, stringsAsFactors = F)
我希望将句子中的第一个单词和倒数第二个单词隔离开来。倒数第二个将始终在&#34;付款之前。&#34;
这是我想要的df的样子:
strings <- c("Average complications and higher payment",
"Average complications and average payment",
"Average complications and lower payment",
"Average mortality and higher payment",
"Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)
生成的字符串不需要区分大小写。
我能够编写代码来获取句子中的第一个单词(在空格处分开),但无法弄清楚如何将单词拉到左侧(或右侧,就此而言)参考词,&#34;付款&#34;在这种情况下。
答案 0 :(得分:1)
df$QualityWord = sub("(\\w+).*?$", "\\1", df$strings)
df$PaymentWord = sub(".*?(\\w+) payment$", "\\1", df$strings)
df
#> strings QualityWord PaymentWord
#> 1 Average complications and higher payment Average higher
#> 2 Average complications and average payment Average average
#> 3 Average complications and lower payment Average lower
#> 4 Average mortality and higher payment Average higher
#> 5 Better mortality and average payment Better average
正则表达式术语解释:
(\\w+)
=匹配单词字符一次或多次,作为一组捕获.*?
=匹配任何内容,非贪婪payment
=匹配一个空格,然后匹配字符payment
$
=匹配字符串的结尾。 \\1
=将模式替换为第一组中的内容。答案 1 :(得分:1)
我们可以使用extract
tidyr
library(tidyverse)
df %>%
extract(strings, into = c("QaulityWord", "PaymentWord"),
"^(\\w+).*\\b(\\w+)\\s+\\w+$", remove = FALSE)
# strings QaulityWord PaymentWord
#1 Average complications and higher payment Average higher
#2 Average complications and average payment Average average
#3 Average complications and lower payment Average lower
#4 Average mortality and higher payment Average higher
#5 Better mortality and average payment Better average
答案 2 :(得分:0)
使用strsplit
,head
和tail
函数:
outDF = do.call(rbind,lapply(DF$strings,function(x) {
#split string
strObj = unlist(strsplit(x,split=" "))
#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE)
}))
outDF
# strings QualityWord PaymentWord
#1 Average complications and higher payment Average higher
#2 Average complications and average payment Average average
#3 Average complications and lower payment Average lower
#4 Average mortality and higher payment Average higher
#5 Better mortality and average payment Better average
或强>
使用dplyr
和自定义函数:
customFn = function(x) {
strObj = unlist(strsplit(x,split=" "));
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}
DF %>%
dplyr::rowwise() %>%
dplyr::do(customFn(.$strings))