批量重命名文件和文件夹是一个经常被问到的问题,但经过一些搜索后,我认为没有一个与我的相似。
背景:我们将一些生物样本发送给服务提供商,该服务提供商返回具有唯一名称的文件和文本格式的表格,其中包含文件名和产生该样本的样本:
head samples.txt
fq_file Sample_ID Sample_name Library_ID FC_Number Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz S1746_B_7_t B 7 t L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz S1726_A_3_t A 3 t L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz S1731_A_GFP_c A GFP c L2354_A_GFP_c 163 5
L2377_Track-3893_R1.fastq.gz S1754_B_7_c B 7 c L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz S1739_B_GFP_t B GFP t L2362_B_GFP_t 163 6
目录结构(34个目录):
L2369_Track-3885_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
L2349_Track-3865_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
目标:因为文件名没有意义且难以解释,我想重命名以.bam结尾的文件(保留后缀)和具有相应样本名称的文件夹,以更合适的方式重新排序。结果应如下所示:
7_t_B
7_t_B..bam
deletions.bed
junctions.bed
logs
7_t_B.bam.bai
insertions.bed
left_kept_reads.info
3_t_A
3_t_A.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
我已经用bash和python(新手)一起破解了一个解决方案,但感觉过度设计了。问题是我是否错过了更简单/更优雅的方式?解决方案可以在python,bash和R.也可能是awk,因为我正在努力学习它。作为一个相对初学者确实会让事情复杂化。
这是我的解决方案:
包装器将所有内容放在适当位置并提供工作流程的概念:
#! /bin/bash
# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt | cut -d$'\t' -f1,3 >> BAMfilames.txt
# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py
# finally do the renaming
./renameBam.sh
# and the folders to
./renameBamFolder.sh
renameBamFiles.py:
#! /usr/bin/env python
import re
# Read in the data sample file and create a bash file that will remane the tophat output
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
#
# Set the input file name
# (The program must be run from within the directory
# that contains this data file)
InFileName = 'BAMfilames.txt'
### Rename BAM files
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBam.sh'
OutFile=open(OutFileName,'a') # You can append instead with 'a'
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
### Rename folders
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBamFolder.sh'
OutFile=open(OutFileName,'w')
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "mv %s %s" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
RenameBam.sh - 由上一个python脚本创建:
#! /bin/bash
for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)
重命名renameBamFolder.sh非常相似:
mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B
由于我正在学习,我觉得不同方法的一些例子,以及如何做到这一点,将会非常有用。
答案 0 :(得分:2)
bash中的一种简单方法:
find . -type d -print |
while IFS= read -r oldPath; do
parent=$(dirname "$oldPath")
old=$(basename "$oldPath")
new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)
if [ -n "$new" ]; then
newPath="${parent}/${new}"
echo mv "$oldPath" "$newPath"
echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
fi
done
在初始测试后删除“echo”以使其实际执行“mv”。
如果所有目标目录都在@ triplee的答案暗示的一个级别,那么它甚至更简单。只需cd到他们的父目录,然后执行:
awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
echo mv "$old" "$new"
echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done
在您的一个预期输出中,您重命名为“。Bai”文件,而另一个您没有重命名,并且您没有说明是否要这样做。如果你想重命名它,只需添加
echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"
上面你喜欢的任何解决方案。
答案 1 :(得分:0)
当然,你只能在Python中做到这一点 - 它可以产生一个小的可读脚本。
第一件事:阅读sampels.txt文件并创建从现有文件前缀到所需映射前缀的映射 - 文件未格式化为使用Python CSV阅读器模块,因为列分隔符用于最后一个数据列。
mapping = {}
with open("samples.txt") as samples:
# throw away headers
samples.readline()
for line in samples():
# separate the columns spliting the first whitespace ocurrences:
# (either space sequences or tabs)
fields = line.split()
# skipp blank, malformed lines:
if len(fields) < 6:
continue
fq_file, sample_id, Sample_name, Library_ID, FC_Number, track_lanes_pos, *other = fields
# the [:-2] part is to trhow awauy the "R1" sufix as for the example above
file_prefix = fq_file.split(".")[0][:-2]
target_id = "_".join((Library_ID, FC_number. Sample_name))
mapping[file_prefix] = target_id
然后检查目录名称,并在每个名称中包含“.bam”文件以进行重新映射。
import os
for entry in os.listdir("."):
if entry in mapping:
dir_prefix = "./" + entry + "/")
for file_entry in os.listdir(dir_prefix):
if ".bam" in file_entry:
parts = file_entry.split(".bam")
parts[0] = mapping[entry]
new_name = ".bam".join(parts)
os.rename(dir_prefix + file_entry, dir_prefix + new_name)
os.rename(entry, mapping[entry])
答案 2 :(得分:0)
似乎您只需在简单的while
循环中读取索引文件中的必填字段即可。文件的结构并不明显,所以我假设文件是以空格分隔的,Sample_Id
实际上是四个字段(复杂的sample_id,然后是名称中的三个组件)。也许您在Sample_Id
字段中有一个带有内部空格的制表符分隔文件?无论如何,如果我的假设是错误的,这应该很容易适应。
# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
dir=${fq%R1.fastq.gz}
new="${a}_${b}_$c"
echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
echo mv "$dir" "$new"
done
如果输出看起来像你想要的那样,请取出echo
。
答案 3 :(得分:0)
这是使用shell脚本的一种方式。像:
一样运行script.sh /path/to/samples.txt /path/to/data
script.sh
的内容:
# add directory names to an array
while IFS= read -r -d '' dir; do
dirs+=("$dir")
done < <(find $2/* -type d -print0)
# process the sample list
while IFS=$'\t' read -r -a list; do
for i in "${dirs[@]}"; do
# if the directory is in the sample list
if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then
tag="${list[3]}_${list[4]}_${list[2]}"
new="${i%/*}/$tag"
bam="$new/accepted_hits.bam"
# only change name if there's a bam file
if [ -n $bam ]; then
mv "$i" "$new"
mv "$bam" "$new/$tag.bam"
fi
fi
done
done < <(tail -n +2 $1)
答案 4 :(得分:0)
虽然它并不完全符合您的要求(只是想一想):您可能会考虑文件系统的备用“视图” - 使用术语“视图”就像数据库视图一样。您可以通过“用户空间中的文件系统”FUSE执行此操作。可以使用许多现有实用程序来完成此操作,但我不知道一个通常适用于任何文件集的文件,特别是仅用于重命名/重新组织。但作为如何使用它的具体示例,pytagsfs根据您定义的规则创建virtual (fuse) file system,从而使文件的目录结构显示为您想要的。 (也许这对你也有用 - 但是pytagsfs实际上是用于媒体文件。)然后你就可以在那个(虚拟)文件系统上运行,使用通常访问该数据的任何程序。或者,要使虚拟目录结构永久化(如果pytagsfs没有选项可以执行此操作),只需将虚拟文件系统复制到另一个目录(虚拟文件系统之外)。