我有一个文本字符串向量来描述一本书中图像的来源,但不是完全一致的格式-它们是从LaTeX来源捕获并写入文本文件的。
我需要处理这些信息并收集有关来源归因的信息 和权限状态。字符串示例如下:
strings <- '
GBD 2016 Alcohol Collaborators (2018)
\citet {DeBoldFriedman:2015}, permission required
Author graphic, using various public domain images
\url {http://www.histogames.com/HTML/inventaire/periodes-historiques/prehistoire.php}
\url {https://commons.wikimedia.org/wiki/File:Egyptmotionseries.jpg}
\url {http://commons.wikimedia.org}, public domain.
\citet {Plot:1685}, author image collection
Author image collection
From \citet {Priestley:1765}, author image collection
Author image collection
Courtesy Library of Congress
\citet {Langren:1644}, image courtesy of the Koninklijke Bibliotheek van Belgi\"e
Public domain
Author graphic
Author graphic
Author graphic
Author image collection
Wikipedia, https://commons.wikimedia.org/wiki/File:Tablas\_alfonsies.jpg, public domain
'
一项主要任务是创建一个变量status
,该变量可以通过查找来指示权限状态
匹配字符串中的简单正则表达式。类别可以通过以下方式定义
以下模式,所有这些模式均应与字符串的文本匹配
不考虑大小写。
AUpat <- "author (image|graphic|collection)"
PDpat <- "public domain"
REQpat <- "permission required"
LIBpat <- "courtesy|by permission"
# or as a list:
pats <- list(
'AU' = "author (image|graphic|collection)",
'PD' = "public domain",
'REQ'= "permission required",
'LIB'= "courtesy|by permission"
)
伪代码执行我想要的操作(无效):
status <- rep("", length(strings))
for(i in seq_along(strings)) {
if (grep(AUpat, strings[i], ignore.case=TRUE)) status[i] <- "AU"
else if (grep(PDpat, strings[i], ignore.case=TRUE)) status[i] <- "PD"
else if (grep(REQpat, strings[i], ignore.case=TRUE)) status[i] <- "REQ"
else if (grep(LIBpat, strings[i], ignore.case=TRUE)) status[i] <- "LIB"
}
这很丑陋,除了会产生错误:
Error in if (grep(AUpat, strings[i])) status[i] <- "AU" else if (grep(PDpat, :
argument is of length zero
是否有更好的方法来尝试执行此任务,也许使用stringr
或其他tidyverse
工具?
答案 0 :(得分:1)
从您的模式看来,字符串实际上可以属于零个或多个,而不仅仅是一个。如果是这样,那么第一步将是确定字符串拥有哪些类别(0个或更多)。
您的strings
不够用,无法真正测试,因此我将生成一些简单的数据和模式:
strings <- c("something", "something else", "nothing", "nothing here either",
"something or nothing", "interesting",
"something interesting", "nothing interesting")
pats <- c(p1 = "something", p2 = "nothing", p3 = "interesting")
(当模式本身变得不方便用作标签时,将这些模式命名为方便。)从生成logical
矩阵开始:
m <- sapply(pats, grepl, strings, ignore.case = TRUE)
m
# p1 p2 p3
# [1,] TRUE FALSE FALSE
# [2,] TRUE FALSE FALSE
# [3,] FALSE TRUE FALSE
# [4,] FALSE TRUE FALSE
# [5,] TRUE TRUE FALSE
# [6,] FALSE FALSE TRUE
# [7,] TRUE FALSE TRUE
# [8,] FALSE TRUE TRUE
如果必须仅将一个类别应用于字符串,然后假设模式按优先级顺序排列,则可以执行以下操作:
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 2
names(pats)[ apply(m, 1, function(a) head(c(which(a), 0), n = 1)) ]
# [1] "p1" "p1" "p2" "p2" "p1" "p3" "p1" "p2"
注意:当字符串不包含任何模式时,这将创建NA
,表明:
m[8,] <- FALSE
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 NA
在您的假设中,请注意这一点。 (我暂时保留m
的更改。)
如果您需要保留每个字符串的类别,那么如何如何执行取决于您打算存储所述分类的方式。这样做的开始可能是:
str(m2 <- apply(m, 1, function(r) names(which(r))))
# List of 8
# $ : chr "p1"
# $ : chr "p1"
# $ : chr "p2"
# $ : chr "p2"
# $ : chr [1:2] "p1" "p2"
# $ : chr "p3"
# $ : chr [1:2] "p1" "p3"
# $ : chr(0)
这是一个直接存储:list
的每个元素对应于strings
,并且包含零个或多个模式名称。如果这是供人眼食用的,则可能需要将其转换为逗号分隔的集合:
sapply(m2, paste, collapse = ",")
# [1] "p1" "p1" "p2" "p2" "p1,p2" "p3" "p1,p3" ""
或者对于更正式的数据库存储,如何
stringids <- seq_len(length(strings)) # perhaps you have something better?
d <- data.frame(
stringid = rep(stringids, times = lengths(m2)),
ptnmatch = unlist(m2),
stringsAsFactors = FALSE
)
d
# stringid ptnmatch
# 1 1 p1
# 2 2 p1
# 3 3 p2
# 4 4 p2
# 5 5 p1
# 6 5 p2
# 7 6 p3
# 8 7 p1
# 9 7 p3
请注意,不包括字符串8(我将其修改为没有类别),这是设计使然。曾经总是可以用以下方法强制将其重新插入:
misses <- setdiff(stringids, unique(d$stringid))
misses
# [1] 8
d <- rbind(d, data.frame(stringid = misses, ptnmatch = rep(NA, length(misses))),
stringsAsFactors = FALSE)
d
# stringid ptnmatch
# 1 1 p1
# 2 2 p1
# 3 3 p2
# 4 4 p2
# 5 5 p1
# 6 5 p2
# 7 6 p3
# 8 7 p1
# 9 7 p3
# 10 8 <NA>