从文件名列表中提取和匹配集

时间:2019-11-05 15:10:11

标签: r sorting filtering

我有4000多个图像的数据集。为了弄清楚代码,我将其中的一小部分移到了另一个文件夹中。

文件如下:

文件夹

[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"

我无法删除-ch#之前的名称,因为该信息很重要。但是,我要过滤的是此列表的图像,并且仅返回具有所有四个ch值(ch1-4)的集合(即r01c02f10p01)。

我本来以为我们可以按照以下方式解决这个问题:

ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")

通过file.remove函数应用此列表,类似于以下内容:

final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5") 
file.remove(folder,final2) 

但是,为每个ch值创建新变量会将每个文件分段。我不确定如何使用它们来真正区分单个图像是否具有所有四个ch值来有意义地过滤我的图像。我有点不知所措,因为我看到的其他来源都存在与该问题不太匹配的问题。

之前,我能够从这样的图像集中删除所有带有ch5的图像。我认为这可能对尝试仅过滤具有ch1-ch4的图像会有所帮助,但我不确定如何进行。

##Create folder variable which has all image files 
folder <- list.files(getwd())

##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5") 

##Remove final2 from folder
file.remove(folder,final2) 

总结:我希望从没有完整ch值的随机分类中过滤文件(即:可能仅ch1和ch2或ch3和ch4,或ch1,ch2,ch3和ch4)到仅包含以下内容的分类中完整的文件(带有ch1,ch2,ch3和ch4的四个文件)。

1 个答案:

答案 0 :(得分:1)

从类似于list.files或类似名称的文件名向量开始,可以创建文件名的数据框,使用regex提取开头的字母数字部分以及{{1之后的数字}}。然后检查每个组的CH值集中是否都存在期望集中的所有元素(我将其放在"-ch"中,但是您可能需要执行另一种方式)。

ch_set

现在您拥有一个完整组的数据框,只需将匹配的文件名拉出来:

# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")

library(dplyr)

ch_set <- 1:4

files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
  tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
  mutate(ch = as.numeric(ch)) %>%
  group_by(group) %>% 
  filter(all(ch_set %in% ch))

files_to_keep
#> # A tibble: 8 x 3
#> # Groups:   group [2]
#>   filename              group           ch
#>   <chr>                 <chr>        <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01     1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01     2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01     3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01     4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01     1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01     2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01     3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01     4

要注意的一件事是,在没有files_to_keep$filename #> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff" #> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff" #> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff" 行的情况下,我将mutate转换为数字,即将这些数字的字符版本与常规数字版本进行比较-ch转换为匹配类型。如果您需要缩放比例,那似乎并不完全安全,所以我转换为将它们设置为匹配类型。