使用R中的列表从多个列中提取字符串

时间:2018-02-09 15:21:49

标签: r string text-extraction

我正在尝试使用列表从2列以上(下面给出的2列)中提取信息,并创建另一列,其中包含列表中的字符串中的字符串,该列指定要查找的列中的哪一列第一。我有下面的示例以及所需的输出。希望这有助于我正在寻找的东西。

A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT", 
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)

list <- c("NYU","FIT","UCLA","CA","UT","USC")

                        A    B
1       This contains NYU  NYU
2            This has NYU   UT
3             This has XT  USC
4            This has FIT  FIT
5 Something something UNH  UNA
6         I got into UCLA UCLA
7                Hello XT   CA 

我希望代码从列表中搜索并首先查看列A,如果找不到字符串,则查看列B,如果不是,则给出null。通过查看列表,我希望所需的输出看起来如下所示。

                        A    B    C
1       This contains NYU  NYU  NYU
2            This has NYU   UT  NYU
3             This has XT  USC  USC
4            This has FIT  FIT  FIT
5 Something something UNH  UNA <NA>
6         I got into UCLA UCLA UCLA
7                Hello XT   CA   CA

3 个答案:

答案 0 :(得分:4)

您可以将列表转换为regexpr,然后应用R regexpr函数:

expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......

data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])

for(c in cols) {
 index <- regexpr(expr,data[,c])
 data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])     
}

希望这会有所帮助

Gottavianoni

答案 1 :(得分:0)

使用来自tokenizers包的库(tokenizers)。

合并两列并使用合并的A和B

创建一个新列
ConstraintLayout

然后,按照下面的循环,它将在向量中提取值,然后你可以在现有数据帧中绑定向量。

data$newC <- paste(data$A, data$B, sep = " " )

希望它有所帮助。 我正在按照你的预期获得上面的输出。

解决方案图片:

enter image description here

答案 2 :(得分:0)

另一种方法可能是

#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))

#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA

#final dataframe
df

输出是:

                        A    B    C
1       This contains NYU  NYU  NYU
2            This has NYU   UT  NYU
3             This has XT  USC  USC
4            This has FIT  FIT  FIT
5 Something something UNH  UNA   NA
6         I got into UCLA UCLA UCLA
7                Hello XT   CA   CA

示例数据:

df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT", 
"I got into UCLA", "Something something UNH", "This contains NYU", 
"This has FIT", "This has NYU", "This has XT"), class = "factor"), 
    B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA", 
    "FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A", 
"B"), row.names = c(NA, -7L), class = "data.frame")

l <- c("NYU","FIT","UCLA","CA","UT","USC")