Question

我目前正在处理Python中非常大的文件，看起来像

junk
junk
junk
--- intermediate:
1489       pi0     111 [686] (1491,1492)   
                             0.534    -0.050    -0.468     0.724     0.135
1499       pi0     111 [690] (1501,1502)   
                            -1.131     0.503    12.751    12.812     0.135
--- final:
 32        e-      11 [7]    
                             9.072    20.492   499.225   499.727     0.001
 33        e+     -11 [6]    
                           -11.317   -17.699  2632.568  2632.652     0.001
 12         s       3 [10] (91)  >43 {+5}
                             2.946     0.315    94.111    94.159     0.500
 14         g      21 [11] (60,61)  34>>16 {+7,-6}
                            -0.728     3.329     5.932     6.907     0.950
------------------------------------------------------------------------------
junk
junk
--- intermediate:
repeat

我想在“--- final”行之后将每两行合并到“----------------”行。例如，我想要一个输出文件来读取

 32        e-      11 [7]      9.072    20.492   499.225   499.727     0.001
 33        e+     -11 [6]    -11.317   -17.699  2632.568  2632.652     0.001
 12         s       3 [10]     2.946     0.315    94.111    94.159     0.500
 14         g      21 [11]    -0.728     3.329     5.932     6.907     0.950

注意我如何省略没有空格的行中的额外条目。我目前的做法是

start = False
for line in myfile:
    line = line.strip()
    fields = line.split()
    if len(fields)==0:
        continue
    if not start:
        if fields[0] == "----final:":
            start = True
        continue

len（fields）== 0应该在“---------”行结束脚本并继续，直到它看到另一个“---- final”行。我目前不知道怎么做将两条线组合在一起，同时省略了没有空格的线条中的额外信息。有什么建议？

Answer 1

快速而又肮脏的方式合并所有其他行：

for i in range(0,len(lines),2):

    fields1 = lines[i].strip().split()
    fields2 = lines[i+1].strip().split()
    print("\t".join(fields1[:4]+fields2))

请注意，我在此考虑将要合并的所有行都提取并放入名为lines的列表中，并且我只是硬编码将从每个第一行保留的元素数（4）。

Answer 2

只要您知道所需部分周围的确切线条：

#split the large text into lines
lines = large_text.split('\n')
#get the indexes of the beginning and end of your target section
idx_start = lines.index("--- final:")
idx_finish= lines.index("------------------------------------------------------------------------------")
#iterate through the section in steps of 2, split on spaces, remove empty strings, print them as tab delimited
for idx in range( idx_start+1, idx_finish, 2):
    out = list(filter(None,(lines[idx]+lines[idx+1]).split(" ")))
    print("\t".join(out))

其中large_text是作为巨型字符串导入的文件。

修改为了打开文件`large_text.txt＆＃39;作为一个字符串试试这个：

with open('large_text.txt','r') as f: #split the large text into lines lines = f.readlines() #get the indexes of the beginning and end of your target section idx_start = lines.index("--- final:") idx_finish= lines.index("------------------------------------------------------------------------------") #iterate through the section in steps of 2, split on spaces, remove empty strings, print them as tab delimited for idx in range( idx_start+1, idx_finish, 2): out = list(filter(None,(lines[idx]+lines[idx+1]).split(" "))) print("\t".join(out))

<强>假设

你知道分开感兴趣的部分的行（IE：＆＃34; --- final：＆＃34;）

您的值是空格而非制表符分隔。如果没有将split(" ")更改为split("\t")

应该是胜利者 添加了格式化修复到一组行。同样的假设也适用。

with open('./large_text.txt','r') as f: #split the large text into lines lines = f.read().split("\n") #get the indexes of the beginning and end of your target section idx_start = lines.index("--- final:") idx_finish= lines.index("------------------------------------------------------------------------------") #iterate through the section in steps of 2, split on spaces, remove empty strings, print them as tab delimited for idx in range( idx_start+1, idx_finish, 2): line_spaces = list(filter(None,lines[idx].split(" ")))[0:4] other_line = list(filter(None,(lines[idx+1]).split(" "))) out = line_spaces + other_line print("\t".join(out))

Answer 3

您可以使用较新的regex模块和一些正则表达式来解决您的问题：

import regex as re rx = re.compile(r'''(?V1) (?:^---\ final:[\n\r])|(?:\G(?!\A)) ^(\ *\d+.+?)\ *$[\n\r] ^\ +(.+)$[\n\r] ''', re.MULTILINE | re.VERBOSE) junky_string = your_string matches = [" ".join(match.groups()) for match in rx.finditer(junky_string) if match.group(1) is not None] print(matches) # [' 32 e- 11 [7] 9.072 20.492 499.225 499.727 0.001', # ' 33 e+ -11 [6] -11.317 -17.699 2632.568 2632.652 0.001', # ' 12 s 3 [10] (91) >43 {+5} 2.946 0.315 94.111 94.159 0.500', # ' 14 g 21 [11] (60,61) 34>>16 {+7,-6} -0.728 3.329 5.932 6.907 0.950']

这会在一行或多行的最开头查找--- final:，然后在匹配--- final:之后找到后的数字（了解详情explanation on regex101.com ）。
之后，匹配的项目将与制表符合并。

在python中读取.txt文件的同时组合每两行

3 个答案: