Question

我正在尝试编辑文件的格式，不管它看起来像这样：

＆gt;群集0
    L07510
    ＆gt;群集1
    AF480591
    AY457083
    ＆gt;群集2
    M88154
    ＆gt;群集3
    CP000924
    L09161
    ＆gt;群集4
    AY742307
    ＆gt;群集5
    L09163
    L09162
    ＆gt;群集6
    AF321086
    ＆gt;簇7     DQ666175
    ＆gt;群集8
    DQ288691

我想在python中写一些东西，这些东西将通过每一行，停在说“＆gt; Cluster x”（x是一个数字）的行，然后将该数字添加到跟随它的任何行。然后，当到达新的“＆gt; Cluster x”时，它再次以新的x值开始。

所以它看起来像这样：

＆gt;群集0
    0 L07510
    ＆gt;群集1
    1 AF480591
    1 AY457083
    ＆gt;群集2
    2 M88154
    ＆gt;群集3
    3 CP000924
    3 L09161
    ＆gt;群集4
    4 AY742307
    ＆gt;群集5
    5 L09163
    5 L09162
    ＆gt;群集6
    6 AF321086
    ＆gt;簇7     7 DQ666175
    ＆gt;群集8
    8 DQ288691

我以为我可以使用regex，搜索">Cluster x"（正则表达式会是这样的吗？('\>Cluster \d+')）然后让程序追加此匹配的正则表达式后面的每一行无论\d+是什么。我只是不确定如何写这个。任何帮助将不胜感激！

Answer 1

测试

# If you're on a POSIX compliant system, and this script is marked as 
# executable, the following line will make this file be automatically 
# run by the Python interpreter rather than interpreted as a shell script
#!/usr/bin/env python

# We need the sys module to read arguments from the terminal
import sys

# Open the input file, default mode is 'r', readonly, which is a safe default
infile = open(sys.argv[1])

# Prepare a variable for the cluster number to be used within the loop
cluster = ''

# loop through all lines in the file, but first set up a list comprehension
# that strips the newline character off the line for each line that is read
for line in (line.strip() for line in infile):
    if line.startswith('>'):
        # string.split() splits on whitespace by default
        # we want the cluster number at index 1
        cluster = line.split()[1]

        # output this line to stdout unmodified
        print line

    else:
        # output any other line modified by adding the cluster number
        print cluster + ' ' + line

用法

$ python cluster_format.py input.txt > output.txt

Answer 2

哦，我喜欢解析。

这是交易：

current_cluster = ""
new_lines = ""

# assuming all this text is in a variable called lines
for line in lines.split("\n"):
    if line.startswith(">Cluster"):
        # 9 characters is ">Cluster "
        current_cluster=line[9:].strip()
    else:
        # otherwise, just take the line itself and prepend the current cluster
        line = "{} {}".format(current_cluster, line)

    new_lines += "{}\n".format(line)

匹配一条线，将它们附加在它下面

2 个答案: