Question

我正在尝试将数据帧列中的值与字符串向量中的元素进行字符串匹配。如果有匹配项，我希望返回向量中的元素。我正在dplyr:mutate中使用一个函数来尝试实现这一目标。

我有一个名为keywords的数据框，如下所示：

+-----------------------+-------------+---------------+
|      Page.Title       | Event.Label | Unique.Events |
+-----------------------+-------------+---------------+
| Awesome Sale in Spain | pool        |           123 |
| Spain Holidays        | pool        |            34 |
| Edinburgh Castles     | sea-view    |            45 |
| London Houses         | help-to-buy |            56 |
| Cars in Greece        | beach       |            82 |
+-----------------------+-------------+---------------+

我有一个名为locations的向量，如下所示：

c('Edinburgh', 'London', 'Spain')

我创建了一个名为location_finder的函数，如下所示：

function(locations,col_name){
  for (i in locations) {
    if (str_detect(col_name, i)) {
      return(i)
    } else {
      return ('Other')
    }
  }
}

我的代码是：

require(dplyr)
require(magrittr)
require(stringr)

df_working <- rowwise(keywords) %>%
  mutate(Location=location_finder(locations,Page.Title))

我的预期输出是：

+-----------------------+-------------+---------------+-----------+
|      Page.Title       | Event.Label | Unique Events | Location  |
+-----------------------+-------------+---------------+-----------+
| Awesome Sale in Spain | pool        |           123 | Spain     |
| Spain Holidays        | pool        |            34 | Spain     |
| Edinburgh Castles     | sea-view    |            45 | Edinburgh |
| London Houses         | help-to-buy |            56 | London    |
| Cars in Greece        | beach       |            82 | Other     |
+-----------------------+-------------+---------------+-----------+

我的结果仅与“爱丁堡”匹配，否则仅返回“其他”。大概是因为“爱丁堡”是向量中的第一个元素。任何帮助将不胜感激。

Answer 1

您可以使用grepl重写函数，然后从城市列表中提取匹配项，如下所示：

string <- "Awesome Sale in Spain"
cities <- c('Edinburgh', 'London', 'Spain')
cities[sapply(cities, grepl, string)]

如果有多个匹配项，此解决方案还将返回多个城市。

编辑：

这里已经完成了一个数据框

df <- data.frame(Page.Title = c("Awesome Sale in Spain", "Spain Holidays", "Edinburgh Castles", "London Houses", "Cars in Greece"),
                 Event.Label = c("pool", "pool", "sea-view", "help-to-buy", "beach"))

cities <- c('Edinburgh', 'London', 'Spain')

df$cities <- sapply(df$Page.Title, function(title) {
  city <- cities[sapply(cities, grepl, title)]
})

EDIT2：

如果您想处理案件，则只需使用：

city <- cities[sapply(cities, grepl, title, ignore.case = TRUE)]

Answer 2

另一个答案，在更大的集合上应该更快：

location_finder <- function(text, keywords, case_insensitive = FALSE, unique_pattern = TRUE) {
  lapply(text, function(t) {
    out <- stringi::stri_extract_all_regex(
      str = t,
      pattern = paste0("\\b",
                       keywords,
                       "\\b"), #Use word boundaries
      vectorize_all = TRUE,
      omit_no_match = FALSE,
      simplify = FALSE,
      opts_regex = stringi::stri_opts_regex(
        case_insensitive = case_insensitive
      )
    )
    out[is.na(out)] <- NULL
    if (unique_pattern) {
      return(unique(unlist(out)))
    } else {
      return(unlist(out))
    }
  })
}

您应该能够在这样的方法中使用它：

library(dplyr)
library(magrittr)
library(stringi)
df <- data.frame(Page.Title = c("Awesome Sale in Spain", "Spain Holidays", "Edinburgh Castles", "London Houses", "Cars in Greece"),
                 Event.Label = c("pool", "pool", "sea-view", "help-to-buy", "beach"))
locations <- c('Edinburgh', 'London', 'Spain')

df_working <- df %>%
  mutate(Location = location_finder(text = Page.Title, keywords = locations))

# If you don't like the NULL in the new column
df_working$Location[sapply(df_working$Location, is.null)] <- "other"

如果您有兴趣，可以在my own package中对此进行并行处理。如果您对该软件包的其余部分不感兴趣，只需获取源代码即可。

Answer 3

我们还可以使用strsplit和which函数：

# split the input string
vals <- sapply(df$cities, function(x) strsplit(x, ' '), USE.NAMES = F)

# check which names are in vector
vals <- sapply(vals, function(x) x[which(x %in% myvec)])

# create new column and set missing value as 'Other'
df$new_col <- vals
df$new_col <- ifelse(is.na(df$new_col),'Other', df$new_col)

Answer 4

您可以尝试：

library(stringr)

keywords$Location<-sapply(df$Page.Title,function(x) na.omit(str_extract(x,locations))[1])

keywords$Location[is.na(keywords$Location)]<-"Other"

如何使用字符串向量在R中使用dplyr mutate进行字符串匹配

4 个答案: