我正在尝试使用列表从2列以上(下面给出的2列)中提取信息,并创建另一列,其中包含列表中的字符串中的字符串,该列指定要查找的列中的哪一列第一。我有下面的示例以及所需的输出。希望这有助于我正在寻找的东西。
A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT",
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)
list <- c("NYU","FIT","UCLA","CA","UT","USC")
A B
1 This contains NYU NYU
2 This has NYU UT
3 This has XT USC
4 This has FIT FIT
5 Something something UNH UNA
6 I got into UCLA UCLA
7 Hello XT CA
我希望代码从列表中搜索并首先查看列A,如果找不到字符串,则查看列B,如果不是,则给出null。通过查看列表,我希望所需的输出看起来如下所示。
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA <NA>
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
答案 0 :(得分:4)
您可以将列表转换为regexpr,然后应用R regexpr函数:
expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......
data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])
for(c in cols) {
index <- regexpr(expr,data[,c])
data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])
}
希望这会有所帮助
Gottavianoni
答案 1 :(得分:0)
使用来自tokenizers包的库(tokenizers)。
合并两列并使用合并的A和B
创建一个新列ConstraintLayout
然后,按照下面的循环,它将在向量中提取值,然后你可以在现有数据帧中绑定向量。
data$newC <- paste(data$A, data$B, sep = " " )
希望它有所帮助。 我正在按照你的预期获得上面的输出。
解决方案图片:
答案 2 :(得分:0)
另一种方法可能是
#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))
#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA
#final dataframe
df
输出是:
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA NA
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
示例数据:
df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT",
"I got into UCLA", "Something something UNH", "This contains NYU",
"This has FIT", "This has NYU", "This has XT"), class = "factor"),
B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA",
"FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A",
"B"), row.names = c(NA, -7L), class = "data.frame")
l <- c("NYU","FIT","UCLA","CA","UT","USC")