我目前正在处理Python中非常大的文件,看起来像
junk
junk
junk
--- intermediate:
1489 pi0 111 [686] (1491,1492)
0.534 -0.050 -0.468 0.724 0.135
1499 pi0 111 [690] (1501,1502)
-1.131 0.503 12.751 12.812 0.135
--- final:
32 e- 11 [7]
9.072 20.492 499.225 499.727 0.001
33 e+ -11 [6]
-11.317 -17.699 2632.568 2632.652 0.001
12 s 3 [10] (91) >43 {+5}
2.946 0.315 94.111 94.159 0.500
14 g 21 [11] (60,61) 34>>16 {+7,-6}
-0.728 3.329 5.932 6.907 0.950
------------------------------------------------------------------------------
junk
junk
--- intermediate:
repeat
我想在“--- final”行之后将每两行合并到“----------------”行。例如,我想要一个输出文件来读取
32 e- 11 [7] 9.072 20.492 499.225 499.727 0.001
33 e+ -11 [6] -11.317 -17.699 2632.568 2632.652 0.001
12 s 3 [10] 2.946 0.315 94.111 94.159 0.500
14 g 21 [11] -0.728 3.329 5.932 6.907 0.950
注意我如何省略没有空格的行中的额外条目。我目前的做法是
start = False
for line in myfile:
line = line.strip()
fields = line.split()
if len(fields)==0:
continue
if not start:
if fields[0] == "----final:":
start = True
continue
len(fields)== 0应该在“---------”行结束脚本并继续,直到它看到另一个“---- final”行。我目前不知道怎么做将两条线组合在一起,同时省略了没有空格的线条中的额外信息。有什么建议?
答案 0 :(得分:0)
快速而又肮脏的方式合并所有其他行:
for i in range(0,len(lines),2):
fields1 = lines[i].strip().split()
fields2 = lines[i+1].strip().split()
print("\t".join(fields1[:4]+fields2))
请注意,我在此考虑将要合并的所有行都提取并放入名为lines
的列表中,并且我只是硬编码将从每个第一行保留的元素数(4)。
答案 1 :(得分:0)
只要您知道所需部分周围的确切线条:
#split the large text into lines
lines = large_text.split('\n')
#get the indexes of the beginning and end of your target section
idx_start = lines.index("--- final:")
idx_finish= lines.index("------------------------------------------------------------------------------")
#iterate through the section in steps of 2, split on spaces, remove empty strings, print them as tab delimited
for idx in range( idx_start+1, idx_finish, 2):
out = list(filter(None,(lines[idx]+lines[idx+1]).split(" ")))
print("\t".join(out))
其中large_text
是作为巨型字符串导入的文件。
修改强> 为了打开文件`large_text.txt'作为一个字符串试试这个:
with open('large_text.txt','r') as f:
#split the large text into lines
lines = f.readlines()
#get the indexes of the beginning and end of your target section
idx_start = lines.index("--- final:")
idx_finish= lines.index("------------------------------------------------------------------------------")
#iterate through the section in steps of 2, split on spaces, remove empty strings, print them as tab delimited
for idx in range( idx_start+1, idx_finish, 2):
out = list(filter(None,(lines[idx]+lines[idx+1]).split(" ")))
print("\t".join(out))
<强>假设强>
split(" ")
更改为split("\t")
应该是胜利者 添加了格式化修复到一组行。同样的假设也适用。
with open('./large_text.txt','r') as f:
#split the large text into lines
lines = f.read().split("\n")
#get the indexes of the beginning and end of your target section
idx_start = lines.index("--- final:")
idx_finish= lines.index("------------------------------------------------------------------------------")
#iterate through the section in steps of 2, split on spaces, remove empty strings, print them as tab delimited
for idx in range( idx_start+1, idx_finish, 2):
line_spaces = list(filter(None,lines[idx].split(" ")))[0:4]
other_line = list(filter(None,(lines[idx+1]).split(" ")))
out = line_spaces + other_line
print("\t".join(out))
答案 2 :(得分:0)
您可以使用较新的regex
模块和一些正则表达式来解决您的问题:
import regex as re
rx = re.compile(r'''(?V1)
(?:^---\ final:[\n\r])|(?:\G(?!\A))
^(\ *\d+.+?)\ *$[\n\r]
^\ +(.+)$[\n\r]
''', re.MULTILINE | re.VERBOSE)
junky_string = your_string
matches = [" ".join(match.groups())
for match in rx.finditer(junky_string)
if match.group(1) is not None]
print(matches)
# [' 32 e- 11 [7] 9.072 20.492 499.225 499.727 0.001',
# ' 33 e+ -11 [6] -11.317 -17.699 2632.568 2632.652 0.001',
# ' 12 s 3 [10] (91) >43 {+5} 2.946 0.315 94.111 94.159 0.500',
# ' 14 g 21 [11] (60,61) 34>>16 {+7,-6} -0.728 3.329 5.932 6.907 0.950']
这会在一行或多行的最开头查找--- final:
,然后在匹配--- final:
之后找到后的数字(了解详情explanation on regex101.com )。
之后,匹配的项目将与制表符合并。