Question

冒着失去声誉的风险，我不知道还能做什么。我的文件没有显示任何隐藏的字符，我已经尝试了我能想到的每个.replace和.strip。我的文件是UTF-8编码的，我使用的是python / 3.6.1 我有一个格式为

的文件

 >header1
 AAAAAAAA
 TTTTTTTT
 CCCCCCCC
 GGGGGGGG

 >header2
 CCCCCC
 TTTTTT
 GGGGGG
 AAAAAA

我正在尝试从文件末尾删除换行符，使每一行成为一个连续的字符串。（这个文件实际上是数千行）。我的代码是多余的，因为我输入了我能想到的删除换行符的所有内容：

 fref = open(ref)
 for line in fref:
     sequence = 0
     header = 0
     if line.startswith('>'):
          header = ''.join(line.splitlines())
          print(header)
     else:
          sequence = line.strip("\n").strip("\r")
          sequence = line.replace('\n', ' ').replace('\r', '').replace(' ', '').replace('\t', '')
          print(len(sequence))

输出是：

 >header1
 8
 8
 8
 8
 >header2
 6
 6
 6
 6

但如果我手动进入并删除行尾以使其成为连续字符串，则会将其显示为全等字符串。

预期产出：

 >header1
 32
 >header2
 24

提前感谢您的帮助，丹尼斯

Answer 1

有几种方法可以解析这种输入。在所有情况下，我建议在功能之外隔离打开和打印副作用，您可以对其进行单元测试以说服自己正确行为。

您可以迭代每一行并分别处理空行和文件结尾的情况。在这里，我使用yield语句来返回值：

def parse(infile):
    for line in infile:
        if line.startswith(">"):
            total = 0
            yield line.strip()
        elif not line.strip():
            yield total
        else:
            total += len(line.strip())
    if line.strip():
        yield total

def test_parse(func):
    with open("input.txt") as infile:
        assert list(parse(infile)) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

或者，您可以同时处理空行和文件结尾。在这里，我使用一个输出数组，我追加标题和总数：

def parse(infile):
    output = []
    while True:
        line = infile.readline()
        if line.startswith(">"):
            total = 0
            header = line.strip()
        elif line and line.strip():
            total += len(line.strip())
        else:
            output.append(header)
            output.append(total)
            if not line:
                break

    return output

def test_parse(func):
    with open("input.txt") as infile:
        assert parse(infile) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

或者，您也可以将整个输入文件拆分为空行分隔块并单独解析它们。在这里，我使用输出流写入输出;在生产中，您可以传递 sys.stdout 流，例如：

import re
def parse(infile, outfile):
    content = infile.read()
    for block in re.split(r"\r?\n\r?\n", content):
        header, *lines = re.split(r"\s+", block)
        total = sum(len(line) for line in lines)
        outfile.write("{header}\n{total}\n".format(
            header=header,
            total=total,
        ))

from io import StringIO
def test_parse(func): 
    with open("/tmp/a.txt") as infile: 
        outfile = StringIO() 
        parse(infile, outfile) 
        outfile.seek(0) 
        assert outfile.readlines() == [ 
            ">header1\n", 
            "32\n", 
            ">header2\n", 
            "24\n", 
        ]

请注意，为了简洁，我的测试使用 open（＆＃34; input.txt＆＃34;），但实际上我建议传递一个 StringIO（...）而实例则是更容易看到正在测试的输入，以避免命中文件系统并使测试更快。

Answer 2

根据我对你的问题的理解，你会想要这样的事情：注意如何在循环的多个迭代步骤中构建序列，因为您希望组合多个行。

with open(ref) as f:
    sequence = "" # reset sequence
    header = None
    for line in f:
        if line.startswith('>'):
            if header:
                print(header)        # print last header
                print(len(sequence)) # print last sequence
            sequence = ""      # reset sequence
            header = line[1:]  # store header
        else:
            sequence += line.rstrip()   # append line to sequence

无法删除python

2 个答案: