Question

我想这个问题并没有说出完整的故事。

我们说我们有2个列表，一个是一个查找单词列表，另一个是多个单词列表

实施例

data.lookup <- c('one', 'two', 'three')

data.real <- c('somewhere one day', 'mysterious elephants', 'two apple-pies', 
'love three corner', 'coffee break', 'three cats')

现在，我们要检查data.real中的每个元素是否包含来自data.lookup的单词。

例如，某天某某地＆＃39;包含一个＆＃39;

然后，我们保存了一个＆＃39;与某一天的某个地方相同的指数＆＃39;

目前，我有这个功能正是如此。

checkFromList <- function (data.lookup, data.real) {

    df <- data.frame('sentence' = data.real, 'lookup' = 1:length(data.real))

    for (lookup in data.lookup) {

        iter <- 1 #set iteraetion

        for (sentence in data.real) {
            #If match then append
            if (grepl(lookup, sentence) == TRUE) {

                df[iter,2] <- lookup

            }

            iter <- iter + 1

        }

    }

    iter = 1 #set itereation 

    for (word in df[,2]) {

        if (is.element(word, data.lookup) == FALSE ) {

            df[iter, 2] <- 'nan'

        }

        iter <- iter + 1

    }

    return (df)

}

运行此功能：

checkFromList(data.lookup, data.real)

输出：

我知道，性能明智这个功能不是很好（太多for循环）。

我要求建议如何改进我的代码。有什么地方可以写得更好吗？

另外，如果一个句子中有超过2个查找词，你们中的一些人可能会认为会出现问题。我正在使用的数据每个句子只包含3个单词，并且句子中有超过2个查找单词的可能性非常低。

感谢所有帮助并提出建议！

Answer 1

使用基数R：

stack(sapply(data.lookup, function(a) grep(a, data.real, value=T)))

             values    ind
1 somewhere one day    one
2    two apple-pies    two
3 love three corner  three
4        three cats  three

如果你想保留NAs：

stack(sapply(data.real, function(a){
    x = sapply(data.lookup, function(b) grepl(b,a))
    if(any(x)){names(which(x))} else {NA}
}))

  values                  ind
1    one    somewhere one day
2   <NA> mysterious elephants
3    two       two apple-pies
4  three    love three corner
5   <NA>         coffee break
6  three           three cats

Answer 2

我们可以使用stri_extract_all_regex包中的stringi函数执行此操作，我们使用paste0创建正则表达式模式

library(stringi)
stri_extract_all_regex(data.real, paste0(data.lookup, collapse = "|"))


#[[1]]
#[1] "one"

#[[2]]
#[1] NA

#[[3]]
#[1] "two"

#[[4]]
#[1] "three"

#[[5]]
#[1] NA

#[[6]]
#[1] "three"

我们可以将数据框的预期输出创建为

extract_words <- stri_extract_all_regex(data.real, paste0(data.lookup, 
                 collapse = "|"))
data.frame(sentence = data.real, lookup = unlist(extract_words))


#             sentence                lookup
#1    somewhere one day                   one
#2 mysterious elephants                  <NA>
#3       two apple-pies                   two
#4    love three corner                 three
#5         coffee break                  <NA>
#6           three cats                 three

我们还可以使用str_extract或stringr

中的str_extract_all

library(stringr)
str_extract(data.real, paste0(data.lookup, collapse = "|"))

#[1] "one"   NA      "two"   "three" NA      "three"

将列表的每个值与另一个列表

2 个答案: