Question

我正在尝试使用python3将'fastq'文件转换为制表符分隔文件。这是输入:(第1-4行是我需要以制表符分隔格式打印的一条记录）。在这里，我试图将每个记录读入列表对象：

@SEQ_ID
GATTTGGGGTT
+
!''*((((***
@SEQ_ID
GATTTGGGGTT
+
!''*((((***

使用：

data = open('sample3.fq')
fq_record = data.read().replace('@', ',@').split(',')
for item in fq_record:
        print(item.replace('\n', '\t').split('\t'))

输出是：

['']
['@SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '']
['@SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '', '']

我在输出开始时写了一个空行，我不明白为什么？我知道这可以通过许多其他方式完成，但我需要弄清楚我正在学习python的原因。感谢

Answer 1

当您使用@替换,@时，您会在字符串的开头添加逗号（因为它以@开头）。然后当你用逗号分割时，在第一个逗号之前没有任何内容，所以这会在分割中给你一个空字符串。会发生什么基本上是这样的：

>>> print ',x'.split(',')
['', 'x']

如果您知道您的数据始终以@开头，则可以跳过循环中的空记录。只需for item in fq_record[1:]。

Answer 2

你也可以逐行进行，而不是全部替换：

fobj = io.StringIO("""@SEQ_ID
GATTTGGGGTT
+
!''*((((***
@SEQ_ID
GATTTGGGGTT
+
!''*((((***""")

data = []
entry = []
for raw_line in fobj:
    line = raw_line.strip()
    if line.startswith('@'):
        if entry:
            data.append(entry)
        entry = []
    entry.append(line)
data.append(entry)

data看起来像这样：

[['@SEQ_ID', 'GATTTGGGGTTy', '+', "!''*((((***"],
 ['@SEQ_ID', 'GATTTGGGGTTx', '+', "!''*((((***"]]

Answer 3

谢谢大家的回答。作为初学者，我的主要问题是在.split（'，'）上出现了一个空白行，我现在已经从概念上理解了这一点。所以我在python中的第一个有用的程序是：

# this script converts a .fastq file in to .fasta format

import sys 
# Usage statement:
print('\nUsage: fq2fasta.py input-file output-file\n=========================================\n\n')

# define a function for fasta formating
def format_fasta(name, sequence):
fasta_string = '>' + name + "\n" + sequence + '\n'
return fasta_string

# open the file for reading
data = open(sys.argv[1])
# open the file for writing
fasta = open(sys.argv[2], 'wt')
# feed all fastq records in to a list 
fq_records = data.read().replace('@', ',@').split(',')

# iterate through list objects
for item in fq_records[1:]: # this is to avoid the first line which is created as blank by .split() function
    line = item.replace('\n', '\t').split('\t')
    name = line[0]
    sequence = line[1]      
    fasta.write(format_fasta(name, sequence))
fasta.close()

当我了解更多时，答案中建议的其他内容对我来说会更清楚。再次感谢。

.split（）在python3中创建一个空行

3 个答案: