我的目标是根据文件名中间的字符串连接文件夹中的文件,理想情况下使用python或bash。为了简化问题,这是一个例子:
我想基于第一个破折号之后但在第二个破折号之前的值(例如X128或X1324)进行连接,以便我留下(在此示例中)两个附加文件,其中包含个人的连接内容文件:
任何帮助都将不胜感激。
答案 0 :(得分:1)
对于简单的字符串操作,我更喜欢避免使用正则表达式。我认为str.split()
就足够了。此外,对于简单的文件名匹配,库fnmatch
提供了足够的功能。
import fnmatch
import os
from itertools import groupby
path = '/full/path/to/files/'
ext = ".fastq"
files = fnmatch.filter(os.listdir(path), '*' + ext)
def by(fname): return fname.split('-')[1] # Ej. X128
# You said:
# I would like to concatenate based on the value after the first dash
# but before the second (e.g. X128 or X1324)
# If you want to keep both parts together, uncomment the following:
# def by(fname): return '-'.join(fname.split('-')[:2]) # Ej. P16C-X128
for k, g in groupby(sorted(files, key=by), key=by):
dst = str(k) + '-Concat' + ext
with open(os.path.join(path, dst), 'w') as dstf:
for fname in g:
with open(os.path.join(path, fname), 'r') as srcf:
dstf.write(srcf.read())
您可以将连接委托给操作系统,而不是使用Python进行读取。您通常会使用这样的bash命令:
cat *-X128-*.fastq > X128.fastq
使用subprocess
库:
import subprocess
for k, g in groupby(sorted(files, key=by), key=by):
dst = str(k) + '-Concat' + ext
with open(os.path.join(path, dst), 'w') as dstf:
command = ['cat'] # +++
for fname in g:
command.append(os.path.join(path, fname)) # +++
subprocess.run(command, stdout=dstf) # +++
此外,对于像这样的批处理作业,您应该考虑将连接的文件放在一个单独的目录中,但这可以通过更改dst
文件名轻松完成。
答案 1 :(得分:0)
您可以使用open
来读取和写入(创建)文件,os.listdir
以获取特定目录中的所有文件(和目录),并re
根据需要匹配文件名。
使用字典按文件名前缀存储内容(文件的名称直到第3个连字符-
)并将内容连接在一起。
import os
import re
contents = {}
file_extension = "fastq"
# Get all files and directories that are in current working directory
for file_name in os.listdir('./'):
# Use '.' so it doesn't match directories
if file_name.endswith('.' + file_extension):
# Match the first 2 hyphen-separated values from file name
prefix_match = re.match("^([^-]+\-[^-]+)", file_name)
file_prefix = prefix_match.group(1)
# Read the file and concatenate contents with previous contents
contents[file_prefix] = contents.get(file_prefix, '')
with open(file_name, 'r') as the_file:
contents[file_prefix] += the_file.read() + '\n'
# Create new file for each file id and write contents to it
for file_prefix in contents:
file_contents = contents[file_prefix]
with open(file_prefix + '-Concat.' + file_extension, 'w') as the_file:
the_file.write(file_contents)