Question

我有一个包含页面路径的数据框列（让我们称之为A）：

pagePath
/text/other_text/123-string1-4571/text.html
/text/other_text/string2/15-some_other_txet.html
/text/other_text/25189-string3/45112-text.html
/text/other_text/text/string4/5418874-some_other_txet.html
/text/other_text/string5/text/some_other_txet-4157/text.html
/text/other_text/123-text-4571/text.html
/text/other_text/125-text-471/text.html

我还有另一个字符串数据框列，我们可以调用它（B）（两个数据帧不同，它们不具有相同的行数）。

以下是数据框B中我的专栏的一个示例：

names
string1
string11
string4
string3
string2
string10
string5
string100

我想要做的是检查我的页面路径（A）是否包含来自其他数据帧（B）的字符串。

我遇到了困难，因为我的两个数据帧长度不一样且数据没有组织。

预期输出

我希望得到这样的结果：

 pagePath                                                  names     exist
/text/other_text/123-string1-4571/text.html                string1   TRUE
/text/other_text/string2/15-some_other_txet.html           string2   TRUE
/text/other_text/25189-string3/45112-text.html             string3   TRUE
/text/other_text/text/string4/5418874-some_other_txet.html string4   TRUE
/text/string5/text/some_other_txet-4157/text.html          string5   TRUE
/text/other_text/123-text-4571/text.html                     NA      FALSE
/text/other_text/125-text-471/text.html                      NA      FALSE

如果我的问题需要进一步澄清，请提及此事。

Answer 1

我们可以使用exist

生成grepl()列

# Collapse B$names into one string with "|" 
onestring <- paste(B$names, collapse = "|") 

# Generate new column
A$exist <- grepl(onestring, A$pagePath)

Answer 2

不太好，因为包含for循环：

names <- rep(NA, length(A$pagePath))
exist <- rep(FALSE, length(A$pagePath))

for (name in B$names) {
  names[grep(name, A$pagePath)] <- name
  exist[grep(name, A$pagePath)] <- TRUE
}

Answer 3

我们可以使用str_extract_all包中的stringr，NA替换为character(0)，因此我们必须更改

df$names <- as.character(str_extract_all(df$pagePath, "string[0-9]+"))
df$exist <- df$names %in% df1$names
df[df=="character(0)"] <- NA
df
#                                                 pagePath       names   exist
#1                  /text/other_text/123-string1-4571/text.html string1  TRUE
#2             /text/other_text/string2/15-some_other_txet.html string2  TRUE
#3               /text/other_text/25189-string3/45112-text.html string3  TRUE
#4   /text/other_text/text/string4/5418874-some_other_txet.html string4  TRUE
#5 /text/other_text/string5/text/some_other_txet-4157/text.html string5  TRUE
#6                     /text/other_text/123-text-4571/text.html    <NA> FALSE
#7                      /text/other_text/125-text-471/text.html    <NA> FALSE

数据

dput(df) structure(list(pagePath = structure(c(1L, 5L, 4L, 7L, 6L, 2L, 3L), .Label = c("/text/other_text/123-string1-4571/text.html", "/text/other_text/123-text-4571/text.html", "/text/other_text/125-text-471/text.html", "/text/other_text/25189-string3/45112-text.html", "/text/other_text/string2/15-some_other_txet.html", "/text/other_text/string5/text/some_other_txet-4157/text.html", "/text/other_text/text/string4/5418874-some_other_txet.html"), class = "factor")), .Names = "pagePath", class = "data.frame", row.names = c(NA, -7L)) dput(df1) structure(list(names = structure(c(1L, 4L, 7L, 6L, 5L, 2L, 8L, 3L), .Label = c("string1", "string10", "string100", "string11", "string2", "string3", "string4", "string5"), class = "factor")), .Names = "names", class = "data.frame", row.names = c(NA, -8L))

Answer 4

以下是使用apply的一种方式：

df$exist <- apply( df,1,function(x){as.logical(grepl(x[2],x[1]))} )

将dataframe列与另一个dataframe列进行比较

4 个答案: