我正在尝试编辑文件的格式,不管它看起来像这样:
>群集0
L07510
>群集1
AF480591
AY457083
>群集2
M88154
>群集3
CP000924
L09161
>群集4
AY742307
>群集5
L09163
L09162
>群集6
AF321086
>簇7 DQ666175
>群集8
DQ288691
我想在python中写一些东西,这些东西将通过每一行,停在说“> Cluster x”(x是一个数字)的行,然后将该数字添加到跟随它的任何行。然后,当到达新的“> Cluster x”时,它再次以新的x值开始。
所以它看起来像这样:
>群集0
0 L07510
>群集1
1 AF480591
1 AY457083
>群集2
2 M88154
>群集3
3 CP000924
3 L09161
>群集4
4 AY742307
>群集5
5 L09163
5 L09162
>群集6
6 AF321086
>簇7 7 DQ666175
>群集8
8 DQ288691
我以为我可以使用regex
,搜索">Cluster x"
(正则表达式会是这样的吗?('\>Cluster \d+')
)然后让程序追加此匹配的正则表达式后面的每一行无论\d+
是什么。我只是不确定如何写这个。任何帮助将不胜感激!
答案 0 :(得分:2)
测试
# If you're on a POSIX compliant system, and this script is marked as
# executable, the following line will make this file be automatically
# run by the Python interpreter rather than interpreted as a shell script
#!/usr/bin/env python
# We need the sys module to read arguments from the terminal
import sys
# Open the input file, default mode is 'r', readonly, which is a safe default
infile = open(sys.argv[1])
# Prepare a variable for the cluster number to be used within the loop
cluster = ''
# loop through all lines in the file, but first set up a list comprehension
# that strips the newline character off the line for each line that is read
for line in (line.strip() for line in infile):
if line.startswith('>'):
# string.split() splits on whitespace by default
# we want the cluster number at index 1
cluster = line.split()[1]
# output this line to stdout unmodified
print line
else:
# output any other line modified by adding the cluster number
print cluster + ' ' + line
用法
$ python cluster_format.py input.txt > output.txt
答案 1 :(得分:1)
哦,我喜欢解析。
这是交易:
current_cluster = ""
new_lines = ""
# assuming all this text is in a variable called lines
for line in lines.split("\n"):
if line.startswith(">Cluster"):
# 9 characters is ">Cluster "
current_cluster=line[9:].strip()
else:
# otherwise, just take the line itself and prepend the current cluster
line = "{} {}".format(current_cluster, line)
new_lines += "{}\n".format(line)