使用“索引”重命名批处理(基本名称)文件/文件夹

时间:2013-02-20 13:06:41

标签: python r bash awk

批量重命名文件和文件夹是一个经常被问到的问题,但经过一些搜索后,我认为没有一个与我的相似。

背景:我们将一些生物样本发送给服务提供商,该服务提供商返回具有唯一名称的文件和文本格式的表格,其中包含文件名和产生该样本的样本:

head samples.txt
fq_file Sample_ID   Sample_name Library_ID  FC_Number   Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz    S1746_B_7_t B 7 t   L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz    S1726_A_3_t A 3 t   L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz    S1731_A_GFP_c   A GFP c L2354_A_GFP_c   163 5
L2377_Track-3893_R1.fastq.gz    S1754_B_7_c B 7 c   L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz    S1739_B_GFP_t   B GFP t L2362_B_GFP_t   163 6

目录结构(34个目录):

L2369_Track-3885_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info
L2349_Track-3865_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

目标:因为文件名没有意义且难以解释,我想重命名以.bam结尾的文件(保留后缀)和具有相应样本名称的文件夹,以更合适的方式重新排序。结果应如下所示:

7_t_B
   7_t_B..bam      
   deletions.bed   
   junctions.bed         
   logs
   7_t_B.bam.bai  
   insertions.bed  
   left_kept_reads.info
3_t_A
   3_t_A.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

我已经用bash和python(新手)一起破解了一个解决方案,但感觉过度设计了。问题是我是否错过了更简单/更优雅的方式?解决方案可以在python,bash和R.也可能是awk,因为我正在努力学习它。作为一个相对初学者确实会让事情复杂化。

这是我的解决方案:

包装器将所有内容放在适当位置并提供工作流程的概念:

#! /bin/bash

# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt |  cut -d$'\t' -f1,3 >> BAMfilames.txt 

# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py

# finally do the renaming
./renameBam.sh

# and the folders to
./renameBamFolder.sh

renameBamFiles.py:

#! /usr/bin/env python
import re

# Read in the data sample file and create a bash file that will remane the tophat output 
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
# 

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
InFileName = 'BAMfilames.txt'


### Rename BAM files

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBam.sh'

OutFile=open(OutFileName,'a') # You can append instead with 'a'

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)

    print command
    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()


### Rename folders

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBamFolder.sh'

OutFile=open(OutFileName,'w') 

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "mv %s %s" % (folderName, fileName)

    print command

    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()

RenameBam.sh - 由上一个python脚本创建:

#! /bin/bash

for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)

重命名renameBamFolder.sh非常相似:

mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B

由于我正在学习,我觉得不同方法的一些例子,以及如何做到这一点,将会非常有用。

5 个答案:

答案 0 :(得分:2)

bash中的一种简单方法:

find . -type d -print |
while IFS= read -r oldPath; do

   parent=$(dirname "$oldPath")
   old=$(basename "$oldPath")
   new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)

   if [ -n "$new" ]; then
      newPath="${parent}/${new}"
      echo mv "$oldPath" "$newPath"
      echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
   fi
done

在初始测试后删除“echo”以使其实际执行“mv”。

如果所有目标目录都在@ triplee的答案暗示的一个级别,那么它甚至更简单。只需cd到他们的父目录,然后执行:

awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
   echo mv "$old" "$new"
   echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done

在您的一个预期输出中,您重命名为“。Bai”文件,而另一个您没有重命名,并且您没有说明是否要这样做。如果你想重命名它,只需添加

echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"

上面你喜欢的任何解决方案。

答案 1 :(得分:0)

当然,你只能在Python中做到这一点 - 它可以产生一个小的可读脚本。

第一件事:阅读sampels.txt文件并创建从现有文件前缀到所需映射前缀的映射 - 文件未格式化为使用Python CSV阅读器模块,因为列分隔符用于最后一个数据列。

mapping = {}
with open("samples.txt") as samples:
   # throw away headers
   samples.readline()
   for line in samples():
       # separate the columns spliting the first  whitespace ocurrences:
       # (either space sequences or tabs)
       fields = line.split()
       # skipp blank, malformed lines:
       if len(fields) < 6: 
           continue
       fq_file, sample_id, Sample_name, Library_ID,  FC_Number,  track_lanes_pos, *other = fields
       # the [:-2] part is to trhow awauy the "R1"  sufix as for the example above
       file_prefix = fq_file.split(".")[0][:-2]
       target_id = "_".join((Library_ID, FC_number. Sample_name))
       mapping[file_prefix] = target_id

然后检查目录名称,并在每个名称中包含“.bam”文件以进行重新映射。

import os
for entry in os.listdir("."):
     if entry in mapping:
         dir_prefix = "./" + entry + "/")
         for file_entry in os.listdir(dir_prefix):
              if ".bam" in file_entry:
                   parts = file_entry.split(".bam")
                   parts[0] = mapping[entry]
                   new_name = ".bam".join(parts)

                   os.rename(dir_prefix + file_entry, dir_prefix + new_name)
         os.rename(entry, mapping[entry])

答案 2 :(得分:0)

似乎您只需在简单的while循环中读取索引文件中的必填字段即可。文件的结构并不明显,所以我假设文件是​​以空格分隔的,Sample_Id实际上是四个字段(复杂的sample_id,然后是名称中的三个组件)。也许您在Sample_Id字段中有一个带有内部空格的制表符分隔文件?无论如何,如果我的假设是错误的,这应该很容易适应。

# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
    dir=${fq%R1.fastq.gz}
    new="${a}_${b}_$c"
    echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
    echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
    echo mv "$dir" "$new"
done

如果输出看起来像你想要的那样,请取出echo

答案 3 :(得分:0)

这是使用shell脚本的一种方式。像:

一样运行
script.sh /path/to/samples.txt /path/to/data

script.sh的内容:

# add directory names to an array
while IFS= read -r -d '' dir; do

    dirs+=("$dir")

done < <(find $2/* -type d -print0)


# process the sample list
while IFS=$'\t' read -r -a list; do

    for i in "${dirs[@]}"; do

        # if the directory is in the sample list
        if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then

            tag="${list[3]}_${list[4]}_${list[2]}"
            new="${i%/*}/$tag"
            bam="$new/accepted_hits.bam"

            # only change name if there's a bam file
            if [ -n $bam ]; then

                mv "$i" "$new"
                mv "$bam" "$new/$tag.bam"
            fi
        fi
    done

done < <(tail -n +2 $1)

答案 4 :(得分:0)

虽然它并不完全符合您的要求(只是想一想):您可能会考虑文件系统的备用“视图” - 使用术语“视图”就像数据库视图一样。您可以通过“用户空间中的文件系统”FUSE执行此操作。可以使用许多现有实用程序来完成此操作,但我不知道一个通​​常适用于任何文件集的文件,特别是仅用于重命名/重新组织。但作为如何使用它的具体示例,pytagsfs根据您定义的规则创建virtual (fuse) file system,从而使文件的目录结构显示为您想要的。 (也许这对你也有用 - 但是pytagsfs实际上是用于媒体文件。)然后你就可以在那个(虚拟)文件系统上运行,使用通常访问该数据的任何程序。或者,要使虚拟目录结构永久化(如果pytagsfs没有选项可以执行此操作),只需将虚拟文件系统复制到另一个目录(虚拟文件系统之外)。