我有一个数据框,其中包含文件路径和名称,格式如下:
files_list <- c(
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file.txt",
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file - Copy.txt",
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file (1).txt",
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file - Copy (2).txt",
"C:/User/Name/Folder/Subfolder1/fileB.txt",
"C:/User/Name/Folder/file.C.txt",
"C:/User/Name/Folder/file-D.txt",
"C:/User/Name/Folder/file",
"C:/User/Name/Folder/file Z.txt",
"C:/User/Name/Folder/file - backup.txt"
)
每个文件都有一个父文件夹和一个名称。这些名称可能包含一个或多个句点“。”和/或破折号“-”。另外,有些具有“复制”符号,数字名称和/或文件扩展名。我想将数据转换为如下形式:
[1] "Sub-subfolder file txt"
[2] "Sub-subfolder file Copy txt"
[3] "Sub-subfolder file 1 txt"
[4] "Sub-subfolder file Copy 2 txt"
[5] "Subfolder1 fileB txt"
[6] "Folder file.C txt"
[7] "Folder file-D txt"
[8] "Folder file"
[9] "Folder file Z txt"
[10] "Folder file - backup txt"
这是我认为应该达到目的的代码:
sub(
"(^.:/)([^/.]+/)*([^/.]+/)([^/]+)(\\s-\\sCopy)?(\\s\\(([0-9]+)\\))?(\\.([^.]+))?$",
"\\3 \\4 \\5 \\7 \\9",
files_list
)
但是我得到的是这个
[1] "Sub-subfolder/ file.txt "
[2] "Sub-subfolder/ file - Copy.txt "
[3] "Sub-subfolder/ file (1).txt "
[4] "Sub-subfolder/ file - Copy (2).txt "
[5] "Subfolder1/ fileB.txt "
[6] "Folder/ file.C.txt "
[7] "Folder/ file-D.txt "
我可以处理的斜杠“ /”和多余的空格,但是“复制”符号,数字名称和文件扩展名没有像我期望的那样分开。
关于如何识别“复制”符号,数字名称和文件扩展名的任何建议?还是我应该只在一行代码中标识父文件夹,然后在另一行中将其余文件夹分开?
(最终,我将这些文本字符串转换为数据框,其中文件夹,文件名,副本名称和扩展名是单独的列。我敢肯定,我可以使用tidyr::separate
来做到这一点,但这还需要了解正则表达式,我想学习如何使用()
和反向引用。)
答案 0 :(得分:2)
这可能会有所帮助:
library(tools)
as.data.frame(cbind(dirname(files_list), file_path_sans_ext(basename(files_list)), file_ext(files_list)))
# V1 V2 V3
#1 C:/User/Name/Folder/Subfolder1/Sub-subfolder file txt
#2 C:/User/Name/Folder/Subfolder1/Sub-subfolder file - Copy txt
#3 C:/User/Name/Folder/Subfolder1/Sub-subfolder file (1) txt
#4 C:/User/Name/Folder/Subfolder1/Sub-subfolder file - Copy (2) txt
#5 C:/User/Name/Folder/Subfolder1 fileB txt
#6 C:/User/Name/Folder file.C txt
#7 C:/User/Name/Folder file-D txt
#8 C:/User/Name/Folder file
答案 1 :(得分:1)
我仍然不知道您是否需要它们作为字符串:如下所示
gsub("[/().]| - "," ",sub(".*?([^/]+/[^/]+$)","\\1",files_list))
[1] "Sub-subfolder file txt"
[2] "Sub-subfolder file Copy txt"
[3] "Sub-subfolder file 1 txt"
[4] "Sub-subfolder file Copy 2 txt"
[5] "Subfolder1 fileB txt"
[6] "Folder file C txt"
[7] "Folder file-D txt"
[8] "Folder file"
如果您只需要一种模式,则:
pattern="[^/]+(?=/[^/]+$)|\\w+(?=[ ).-])|\\w+$"
regmatches(files_list,gregexpr(pattern,files_list,perl = TRUE))
答案 2 :(得分:0)
很抱歉,如果这样做不是最好的方法。我已经意识到我的问题是不完整的,我想使问题更完整,同时也分享我想出的解决方案。
我希望这段代码处理所有可能的名称结构:
我使用此代码生成示例文件名/路径,该文件名/路径涵盖所有文件夹/名称/副本/数字/扩展名组合:
files.df <- expand.grid(
c("C:/"),
c("", "F1/", "F1/F2/"),
c("folder/"),
c("file"),
c("", " space", "-dash", " - spacedash", ".period", ".firstperiod.secondperiod"),
c("", 1, " 1", 10, " 10"),
c("", " - Copy"),
c("", " (1)", " (10)"),
c("", ".999", ".aaa"),
stringsAsFactors = F
)
for (i in 1:nrow(files.df)) {
if (!exists("x")) {
x <- vector(mode="character", length=0)
}
x[i] <- paste(as.character(as.vector(files.df[i, ])), sep = "", collapse = "")
}
使用(regex101,通过@Onyambu!)进行了多次试验和错误,我整理了以下可实际使用的荒谬正则表达式:
sum(grepl(
"^.:/(([^/]+)(?=/)/?)*(?<=/)(([^/](?! - Copy| \\([0-9]+\\)|\\.[^/\\.]+$))+.)( - )?((?<= - )Copy(?= \\([0-9]+\\)(?=\\.[^/\\.]+$|$)|\\.[^/\\.]+$|$))?( \\()?((?<= \\()([0-9]+)\\)(?=\\.[^/\\.]+$|$))?\\.?((?<=\\.)([^/\\.]+))?$",
x,
perl = T
))
[1] 1620
length(x)
[1] 1620
不幸的是,此正则表达式包含10个捕获组,我只能反向引用其中9个(而#10是文件扩展名)。因此,我将使用@RHertel更为优雅的解决方案。但是,如果有人发现减少捕获组数量的方法,请告诉我!