Question

我有一个生成多个输出文件的脚本（例如：out0.txt到out250.txt），我希望能够比较所有输出文件中的特定值，并输出所有输出文件中前10个最高的特定值他们。

例如，每个输出文件中的有多行包含各种数据，我感兴趣的行是在其自己的行上包含匹配统计信息的行。以下是其中一个文件的示例摘录。

 ....
 Score
 Matches: 592 (52.3%) #the 52.3 part of the 592 portion
 Ref: 1 GT......
 Query: 340
 Matches: 584 (54.5%)  #and this for 54.3

具体来说，我对百分比部分感兴趣，因为我只想显示所有文件中前10位的最高百分比。

我在特定数据之前/之后拆分文件，但通常依赖于行号。不幸的是，这些比赛的位置和＃39;线条有点不规则而不是每隔3行左右。

我应该尝试让程序查找％符号旁边的数字，考虑到它是提供该文件输出信息的唯一部分吗？

简而言之，如何在其他字符串输出中仅提取所有文件中百分比的值，然后比较它并输出10个最高值？

谢谢，

Answer 1

通过查看文件，您查找的数据似乎始终以Matches开头，因此使用str.startswith()找到这些行。然后使用正则表达式，您可以找出百分比值。示例代码（Python 2）：

import re

with open('my_file') as input_file:
    percent_lines = filter(lambda x: x.startswith('Matches'), input_file)

percent_regex = re.compile(r'([\d.]+%)')

for line in percent_lines:
    print percent_regex.findall(line)

Answer 2

import re

def get_values_from_file(filename):
    f = open(filename)
    winpat = re.compile("([\d\.]+)\%")

    values = []
    for line in f.readlines():
        if line.find("Matches") >=0:
            percn = float(winpat.findall(line)[0])
            values.append(percn)

    return values

all_values = []    
for filename in ["out0.txt", "out1.txt"]:
    values = get_values_from_file(filename)
    all_values += values

all_values.sort()
all_values.reverse()
print(all_values[0:10])

Answer 3

import re

s = """Score
Matches: 592 (52.3%) #the 52.3 part of the 592 portion
Ref: 1 GT......
Query: 340
Matches: 584 (54.5%)  #and this for 54.3
"""
exp = re.compile("Matches: [0-9]+ \\(([0-9|\\.]*)\\%\\)")
matches = exp.findall(s)
print(matches) #['52.3', '54.5']

比较文件内容

3 个答案: