Question

我有一个文本字符串向量来描述一本书中图像的来源，但不是完全一致的格式-它们是从LaTeX来源捕获并写入文本文件的。

我需要处理这些信息并收集有关来源归因的信息和权限状态。字符串示例如下：

strings <- '
GBD 2016 Alcohol Collaborators (2018)
\citet {DeBoldFriedman:2015}, permission required
Author graphic, using various public domain images
\url {http://www.histogames.com/HTML/inventaire/periodes-historiques/prehistoire.php}
\url {https://commons.wikimedia.org/wiki/File:Egyptmotionseries.jpg}
\url {http://commons.wikimedia.org}, public domain.
\citet {Plot:1685}, author image collection
Author image collection
 From \citet {Priestley:1765}, author image collection
 Author image collection
 Courtesy Library of Congress
\citet {Langren:1644}, image courtesy of the Koninklijke Bibliotheek van Belgi\"e 
Public domain
Author graphic
Author graphic
Author graphic
Author image collection
Wikipedia, https://commons.wikimedia.org/wiki/File:Tablas\_alfonsies.jpg, public domain
'

一项主要任务是创建一个变量status，该变量可以通过查找来指示权限状态匹配字符串中的简单正则表达式。类别可以通过以下方式定义以下模式，所有这些模式均应与字符串的文本匹配不考虑大小写。

AUpat <- "author (image|graphic|collection)"
PDpat <- "public domain"
REQpat <- "permission required"
LIBpat <- "courtesy|by permission"

# or as a list:
pats <- list(
  'AU' = "author (image|graphic|collection)",
  'PD' = "public domain",
  'REQ'= "permission required",
  'LIB'= "courtesy|by permission"
)

伪代码执行我想要的操作（无效）：

status <- rep("", length(strings))
for(i in seq_along(strings)) {
  if      (grep(AUpat, strings[i],  ignore.case=TRUE)) status[i] <- "AU"
  else if (grep(PDpat, strings[i],  ignore.case=TRUE)) status[i] <- "PD"
  else if (grep(REQpat, strings[i], ignore.case=TRUE)) status[i] <- "REQ"
  else if (grep(LIBpat, strings[i], ignore.case=TRUE)) status[i] <- "LIB"
}

这很丑陋，除了会产生错误：

Error in if (grep(AUpat, strings[i])) status[i] <- "AU" else if (grep(PDpat,  : 
  argument is of length zero

是否有更好的方法来尝试执行此任务，也许使用stringr或其他tidyverse工具？

Answer 1

从您的模式看来，字符串实际上可以属于零个或多个，而不仅仅是一个。如果是这样，那么第一步将是确定字符串拥有哪些类别（0个或更多）。

您的strings不够用，无法真正测试，因此我将生成一些简单的数据和模式：

strings <- c("something", "something else", "nothing", "nothing here either",
             "something or nothing", "interesting",
             "something interesting", "nothing interesting")

pats <- c(p1 = "something", p2 = "nothing", p3 = "interesting")

（当模式本身变得不方便用作标签时，将这些模式命名为方便。）从生成logical矩阵开始：

m <- sapply(pats, grepl, strings, ignore.case = TRUE)
m
#         p1    p2    p3
# [1,]  TRUE FALSE FALSE
# [2,]  TRUE FALSE FALSE
# [3,] FALSE  TRUE FALSE
# [4,] FALSE  TRUE FALSE
# [5,]  TRUE  TRUE FALSE
# [6,] FALSE FALSE  TRUE
# [7,]  TRUE FALSE  TRUE
# [8,] FALSE  TRUE  TRUE

如果必须仅将一个类别应用于字符串，然后假设模式按优先级顺序排列，则可以执行以下操作：

apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 2

names(pats)[ apply(m, 1, function(a) head(c(which(a), 0), n = 1)) ]
# [1] "p1" "p1" "p2" "p2" "p1" "p3" "p1" "p2"

注意：当字符串不包含任何模式时，这将创建NA，表明：

m[8,] <- FALSE
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1]  1  1  2  2  1  3  1 NA

在您的假设中，请注意这一点。（我暂时保留m的更改。）

如果您需要保留每个字符串的类别，那么如何如何执行取决于您打算存储所述分类的方式。这样做的开始可能是：

str(m2 <- apply(m, 1, function(r) names(which(r))))
# List of 8
#  $ : chr "p1"
#  $ : chr "p1"
#  $ : chr "p2"
#  $ : chr "p2"
#  $ : chr [1:2] "p1" "p2"
#  $ : chr "p3"
#  $ : chr [1:2] "p1" "p3"
#  $ : chr(0)

这是一个直接存储：list的每个元素对应于strings，并且包含零个或多个模式名称。如果这是供人眼食用的，则可能需要将其转换为逗号分隔的集合：

sapply(m2, paste, collapse = ",")
# [1] "p1"    "p1"    "p2"    "p2"    "p1,p2" "p3"    "p1,p3" ""

或者对于更正式的数据库存储，如何

stringids <- seq_len(length(strings)) # perhaps you have something better?
d <- data.frame(
  stringid = rep(stringids, times = lengths(m2)),
  ptnmatch = unlist(m2),
  stringsAsFactors = FALSE
)
d
#   stringid ptnmatch
# 1        1       p1
# 2        2       p1
# 3        3       p2
# 4        4       p2
# 5        5       p1
# 6        5       p2
# 7        6       p3
# 8        7       p1
# 9        7       p3

请注意，不包括字符串8（我将其修改为没有类别），这是设计使然。曾经总是可以用以下方法强制将其重新插入：

misses <- setdiff(stringids, unique(d$stringid))
misses
# [1] 8

d <- rbind(d, data.frame(stringid = misses, ptnmatch = rep(NA, length(misses))),
           stringsAsFactors = FALSE)
d
#    stringid ptnmatch
# 1         1       p1
# 2         2       p1
# 3         3       p2
# 4         4       p2
# 5         5       p1
# 6         5       p2
# 7         6       p3
# 8         7       p1
# 9         7       p3
# 10        8     <NA>

R：如何根据一组正则表达式对字符串进行分类

1 个答案: