Question

我有一个带有单个列的data.frame＆＃34; Terms＆＃34;。这可能包含多个单词的字符串。每个术语至少包含两个单词或更多单词，没有上限。

在此专栏＆＃34;条款＆＃34;中，我想提取最后一个字并将其存储在新列中＆＃34; Last＆＃34;。

# load library
library(dplyr)
library(stringi)

# read csv 
df <- read("filename.txt",stringsAsFactors=F)

# show df
head(df)

#              Term
# 1 this is for the
# 2   thank you for
# 3   the following
# 4   the fact that
# 5       the first

我准备了一个函数LastWord，当给出一个字符串时它很有效但是，当给出一个字符串向量时，它仍然可以使用向量中的第一个字符串。这使我在mapply使用时强制使用mutate添加列，如下所示。

LastWord <- function(InputWord) {
    stri_sub(InputWord,stri_locate_last(str=InputWord, fixed=" ")[1,1]+1, stri_length(InputWord))
}

df <- mutate(df, Last=mapply(LastWord, df$Term))

使用mapply会使进程变得非常慢。我通常需要一次处理大约1000到1500万行或术语。这需要几个小时。

有人能建议一种方法来创建适用于vector而不是字符串的LastWord函数吗？

Answer 1

您可以尝试：

df$LastWord <- gsub(".* ([^ ]+)$", "\\1", df$Term)
df
             # Term  LastWord
# 1 this is for the       the
# 2   thank you for       for
# 3   the following following
# 4   the fact that      that
# 5       the first     first

在gsub调用中，括号内的表达式至少匹配一次空格（而不是[^ ]+，[a-zA-Z]+也可以起作用）字符串（$）。它位于括号之间的事实允许用\\1捕获表达式。所以gsub只保留括号中的内容作为替换。

修改的：
正如@akrun在评论中提到的那样，在这种情况下，也可以使用sub代替gsub。

Answer 2

要仅提取最后一个单词，您可以直接使用stringi中的矢量化函数，该函数应该非常快

library(stringi)
df$LastWord  <- stri_extract_last_words(df$Term)

现在，如果你想要两个新列，一个包含所有单词但最后一个单词，另一个包含最后一个单词，你可以使用一些正则表达式，如

stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")
#      [,1]              [,2]          [,3]       
# [1,] "this is for the" "this is for" "the"      
# [2,] "thank you for"   "thank you"   "for"      
# [3,] "the following"   "the"         "following"
# [4,] "the fact that"   "the fact"    "that"     
# [5,] "the first"       "the"         "first"

所以你想要的是

df[c("ExceptLast", "LastWord")] <-
    stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")[, 2:3]

（请注意，如果df$Term只包含一个单词，则无效。在这种情况下，您需要修改正则表达式，具体取决于您希望将其包含在哪一列中。）

强制使用mapply有一个解决方法

2 个答案: