以下是我提取的推文样本,该推文作为数据框存储在' text'
(row 1) RT @airfarewatchdog: Los Angeles Los Angeles LAX to Cabo SJD $312 nonstop on @AmericanAir for summer travel. airfare
(row 2) RT @TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t.
下面是我用来提取与我为函数提供的数据匹配的字符串的grep函数。 以下是代码:
toMatch <- c("Los Angeles", "New York")
matches <- unique(grep(paste(toMatch,collapse="|"),
text, value=TRUE))
如果有任何匹配,这会将整个行返回给我。
我只想输出如下:
(row 1) Los Angeles Los Angeles
(row 2) New York
还有一种方法可以在同一行的不同单元格中输出城市吗?
答案 0 :(得分:2)
您可以在str_extract_all
包中尝试stringr
:
text = c("(row 1) RT @airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD $312 nonstop on @AmericanAir for summer travel. #airfare",
"(row 2) RT @TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t.")
stringr::str_extract_all(text, paste(toMatch, collapse = "|"))
[[1]]
[1] "Los Angeles" "Los Angeles"
[[2]]
[1] "New York"
答案 1 :(得分:2)
再次使用来自this answer的Psidom中的str_extract_all
。但是如果你需要列出每个匹配的行,那么你可以尝试这个......
toMatch <- c("Los Angeles", "New York")
text = c("(row 1) RT @airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD",
"(row 2) RT @TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t.",
"(row 3) SOME JUNK HERE",
"(row 4) RT @airfarewatchdog: Los Angeles Los Angeles LAX to New York"
)
a <- unlist(sapply(1:length(text), function(i) {
res <- paste(unlist(stringr::str_extract_all(text[i], paste(toMatch, collapse = "|"))), collapse = ' ')
if (res != "") paste('(row ',i,') ', res, "\n", sep = "")
else NULL
}))
cat(a)
# (row 1) Los Angeles Los Angeles
# (row 2) New York
# (row 4) Los Angeles Los Angeles New York
要以数据框的形式获得结果,每个结果都在一个单独的列中,这是有效的(注意,这是一种通用的方法,适用于每行的任意数量的匹配 - 最终的数据框将自动具有足够的列来容纳最多的匹配项:
a <- sapply(1:length(text), function(i) {
res <- c(i, unlist(stringr::str_extract_all(text[i], paste(toMatch, collapse = "|"))))
if (length(res) > 1 ) {res
} else NULL
})
a <- plyr::ldply(a, rbind)
a[] <- lapply(a, as.character)
a[is.na(a)] <- ""
names(a)[1] <- "row"