仅提取匹配的值而不是整个字符串

时间:2016-06-14 20:10:35

标签: r

以下是我提取的推文样本,该推文作为数据框存储在' text'

(row 1) RT @airfarewatchdog: Los Angeles Los Angeles LAX to Cabo SJD $312 nonstop on @AmericanAir for summer travel. airfare
(row 2) RT @TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t.

下面是我用来提取与我为函数提供的数据匹配的字符串的grep函数。 以下是代码:

toMatch <- c("Los Angeles", "New York")
matches <- unique(grep(paste(toMatch,collapse="|"), 
                    text, value=TRUE))

如果有任何匹配,这会将整个行返回给我。

我只想输出如下:

 (row 1) Los Angeles Los Angeles
 (row 2) New York

还有一种方法可以在同一行的不同单元格中输出城市吗?

2 个答案:

答案 0 :(得分:2)

您可以在str_extract_all包中尝试stringr

text = c("(row 1) RT @airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD $312 nonstop on @AmericanAir for summer travel. #airfare",
         "(row 2) RT @TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t.")

stringr::str_extract_all(text, paste(toMatch, collapse = "|"))
[[1]]
[1] "Los Angeles" "Los Angeles"

[[2]]
[1] "New York"

答案 1 :(得分:2)

再次使用来自this answerPsidom中的str_extract_all。但是如果你需要列出每个匹配的行,那么你可以尝试这个......

toMatch <- c("Los Angeles", "New York")
text = c("(row 1) RT @airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD",
         "(row 2) RT @TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t.",
         "(row 3) SOME JUNK HERE",
         "(row 4) RT @airfarewatchdog: Los Angeles Los Angeles LAX to New York"
)

a <- unlist(sapply(1:length(text), function(i) {
  res <- paste(unlist(stringr::str_extract_all(text[i], paste(toMatch, collapse = "|"))), collapse = ' ')
  if (res != "") paste('(row ',i,') ', res, "\n", sep = "")
  else NULL
}))

cat(a)
# (row 1) Los Angeles Los Angeles
# (row 2) New York
# (row 4) Los Angeles Los Angeles New York

要以数据框的形式获得结果,每个结果都在一个单独的列中,这是有效的(注意,这是一种通用的方法,适用于每行的任意数量的匹配 - 最终的数据框将自动具有足够的列来容纳最多的匹配项:

a <- sapply(1:length(text), function(i) {
  res <- c(i, unlist(stringr::str_extract_all(text[i], paste(toMatch, collapse = "|"))))
  if (length(res) > 1 ) {res
  } else NULL
})
a <- plyr::ldply(a, rbind)
a[] <- lapply(a, as.character)
a[is.na(a)] <- ""
names(a)[1] <- "row"