R:如何根据一组正则表达式对字符串进行分类

时间:2019-05-27 19:17:27

标签: r regex string stringr

我有一个文本字符串向量来描述一本书中图像的来源,但不是完全一致的格式-它们是从LaTeX来源捕获并写入文本文件的。

我需要处理这些信息并收集有关来源归因的信息 和权限状态。字符串示例如下:

strings <- '
GBD 2016 Alcohol Collaborators (2018)
\citet {DeBoldFriedman:2015}, permission required
Author graphic, using various public domain images
\url {http://www.histogames.com/HTML/inventaire/periodes-historiques/prehistoire.php}
\url {https://commons.wikimedia.org/wiki/File:Egyptmotionseries.jpg}
\url {http://commons.wikimedia.org}, public domain.
\citet {Plot:1685}, author image collection
Author image collection
 From \citet {Priestley:1765}, author image collection
 Author image collection
 Courtesy Library of Congress
\citet {Langren:1644}, image courtesy of the Koninklijke Bibliotheek van Belgi\"e 
Public domain
Author graphic
Author graphic
Author graphic
Author image collection
Wikipedia, https://commons.wikimedia.org/wiki/File:Tablas\_alfonsies.jpg, public domain
'

一项主要任务是创建一个变量status,该变量可以通过查找来指示权限状态 匹配字符串中的简单正则表达式。类别可以通过以下方式定义 以下模式,所有这些模式均应与字符串的文本匹配 不考虑大小写。

AUpat <- "author (image|graphic|collection)"
PDpat <- "public domain"
REQpat <- "permission required"
LIBpat <- "courtesy|by permission"

# or as a list:
pats <- list(
  'AU' = "author (image|graphic|collection)",
  'PD' = "public domain",
  'REQ'= "permission required",
  'LIB'= "courtesy|by permission"
)

伪代码执行我想要的操作(无效):

status <- rep("", length(strings))
for(i in seq_along(strings)) {
  if      (grep(AUpat, strings[i],  ignore.case=TRUE)) status[i] <- "AU"
  else if (grep(PDpat, strings[i],  ignore.case=TRUE)) status[i] <- "PD"
  else if (grep(REQpat, strings[i], ignore.case=TRUE)) status[i] <- "REQ"
  else if (grep(LIBpat, strings[i], ignore.case=TRUE)) status[i] <- "LIB"
}

这很丑陋,除了会产生错误:

Error in if (grep(AUpat, strings[i])) status[i] <- "AU" else if (grep(PDpat,  : 
  argument is of length zero

是否有更好的方法来尝试执行此任务,也许使用stringr或其他tidyverse工具?

1 个答案:

答案 0 :(得分:1)

从您的模式看来,字符串实际上可以属于零个或多个,而不仅仅是一个。如果是这样,那么第一步将是确定字符串拥有哪些类别(0个或更多)。

您的strings不够用,无法真正测试,因此我将生成一些简单的数据和模式:

strings <- c("something", "something else", "nothing", "nothing here either",
             "something or nothing", "interesting",
             "something interesting", "nothing interesting")

pats <- c(p1 = "something", p2 = "nothing", p3 = "interesting")

(当模式本身变得不方便用作标签时,将这些模式命名为方便。)从生成logical矩阵开始:

m <- sapply(pats, grepl, strings, ignore.case = TRUE)
m
#         p1    p2    p3
# [1,]  TRUE FALSE FALSE
# [2,]  TRUE FALSE FALSE
# [3,] FALSE  TRUE FALSE
# [4,] FALSE  TRUE FALSE
# [5,]  TRUE  TRUE FALSE
# [6,] FALSE FALSE  TRUE
# [7,]  TRUE FALSE  TRUE
# [8,] FALSE  TRUE  TRUE

如果必须仅将一个类别应用于字符串,然后假设模式按优先级顺序排列,则可以执行以下操作:

apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 2

names(pats)[ apply(m, 1, function(a) head(c(which(a), 0), n = 1)) ]
# [1] "p1" "p1" "p2" "p2" "p1" "p3" "p1" "p2"

注意:当字符串不包含任何模式时,这将创建NA,表明:

m[8,] <- FALSE
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1]  1  1  2  2  1  3  1 NA

在您的假设中,请注意这一点。 (我暂时保留m的更改。)

如果您需要保留每个字符串的类别,那么如何如何执行取决于您打算存储所述分类的方式。这样做的开始可能是:

str(m2 <- apply(m, 1, function(r) names(which(r))))
# List of 8
#  $ : chr "p1"
#  $ : chr "p1"
#  $ : chr "p2"
#  $ : chr "p2"
#  $ : chr [1:2] "p1" "p2"
#  $ : chr "p3"
#  $ : chr [1:2] "p1" "p3"
#  $ : chr(0) 

这是一个直接存储:list的每个元素对应于strings,并且包含零个或多个模式名称。如果这是供人眼食用的,则可能需要将其转换为逗号分隔的集合:

sapply(m2, paste, collapse = ",")
# [1] "p1"    "p1"    "p2"    "p2"    "p1,p2" "p3"    "p1,p3" ""     

或者对于更正式的数据库存储,如何

stringids <- seq_len(length(strings)) # perhaps you have something better?
d <- data.frame(
  stringid = rep(stringids, times = lengths(m2)),
  ptnmatch = unlist(m2),
  stringsAsFactors = FALSE
)
d
#   stringid ptnmatch
# 1        1       p1
# 2        2       p1
# 3        3       p2
# 4        4       p2
# 5        5       p1
# 6        5       p2
# 7        6       p3
# 8        7       p1
# 9        7       p3

请注意,不包括字符串8(我将其修改为没有类别),这是设计使然。曾经总是可以用以下方法强制将其重新插入:

misses <- setdiff(stringids, unique(d$stringid))
misses
# [1] 8

d <- rbind(d, data.frame(stringid = misses, ptnmatch = rep(NA, length(misses))),
           stringsAsFactors = FALSE)
d
#    stringid ptnmatch
# 1         1       p1
# 2         2       p1
# 3         3       p2
# 4         4       p2
# 5         5       p1
# 6         5       p2
# 7         6       p3
# 8         7       p1
# 9         7       p3
# 10        8     <NA>