Question

我有一个包含大约50个.txt文件的文件夹，其中包含以下格式的数据。

=== Predictions on test data ===

 inst#     actual  predicted error distribution (OFTd1_OF_Latency)
     1        1:S        2:R   +   0.125,*0.875 (73.84)

我需要编写一个程序，它结合了以下内容：我的索引号（i），真类的字母（R或S），预测类的字母和每个分布预测（小数小于比1.0）。

我希望它在完成后看起来像以下内容，但最好是.csv文件。

ID   True   Pred   S      R
1    S      R      0.125  0.875
2    R      R      0.105  0.895
3    S      S      0.945  0.055
.    .      .      .      .
.    .      .      .      .
.    .      .      .      .
n    S      S      0.900  0.100

我是一个初学者，有点模糊如何解析所有这些，然后连接和追加。这就是我的想法，但如果更容易，可以随意提出另一个方向。

for i in range(1, n):
   s = str(i)
   readin = open('mydata/output/output'+s+'out','r')
   #The files are all named the same but with different numbers associated
   output = open("mydata/summary.csv", "a")
   storage = []
   for line in readin:
     #data extraction/concatenation here
     if line.startswith('1'):
        id = i
        true = # split at the ':' and take the letter after it
        pred = # split at the second ':' and take the letter after it
         #some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
        ds = # split at the ',' and take the string of 5 digits before it
        if pred == 'R':
           dr = #skip the character after the comma but take the have characters after
        else: 
           #take the five characters after the comma
        lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
     else: continue
   output.write(lineholder)

我认为使用索引是另一种选择，但是如果任何文件中的间距都没有关闭，它可能会使事情变得复杂，我还没有检查过。

感谢您的帮助！

Answer 1

首先，如果你想使用CSV，你应该使用python附带的CSV模块。有关此模块的更多信息，请访问：https://docs.python.org/2.7/library/csv.html我不会演示如何使用它，因为它非常简单。

至于阅读输入数据，我的建议是如何分解数据本身的每一行。我假设输入文件中的数据行的值用空格分隔，并且每个值都不能包含空格：

def process_line(id_, line):
    pieces = line.split() # Now we have an array of values
    true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
    pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
    if len(pieces) == 6: # There was an error, the + is there
        p4 = pieces[4]
    else: # There was no '+' only spaces
        p4 = pieces[3]
    ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
    if pred == 'R':
        dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
    else:
        dr = p4.split(',')[0]
    return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr

我在这里主要使用的是字符串的分割函数：https://docs.python.org/2/library/stdtypes.html#str.split并且在一个地方这个简单的str [1：]语法跳过字符串的第一个字符（字符串毕竟是数组，我们可以使用这种切片语法）。

请记住，我的功能不会处理任何错误或行格式，而不是您发布的错误或行格式。如果每行中的值由制表符而不是空格分隔，则应将此行替换为pieces = line.split() pieces = line.split('\t')。

Answer 2

我认为你可以选择浮动，然后在re模块的帮助下将它与字符串结合起来，如下所示：

import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code

这个编程是为一行编写的，你也可以将它用于多行，但你需要使用一个循环

sample.txt的内容： 1 1：S 2：R + 0.125，* 0.875（73.84）

2 1：S 2：R + 0.15，* 0.85（69.4）

当你运行prog时，结果将是： [[＆＃39; 1：S，＆＃39; 2：R＆＃39]，[＆＃39; 1：S＆＃39;＆＃39; 2：R＆＃39]，[＆＃39 ; 0.125＆＃39;＆＃39; 0.875＆＃39;＆＃39; 73.84＆＃39;]，[＆＃39; 0.15，＆＃39; 0.85，＆＃39; 69.4＆＃39; ]

简单地连接它们

Answer 3

这使用正则表达式和CSV模块。

import re
import csv

matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'

output = csv.writer(open('mydata/summary.csv', 'w'))

for i in range(1, n):
    for line in open(filenametemplate % i):
        m = matcher.match(line)
        if m:
           output.write([i] + list(m.groups()))

从多个TXT文件中提取数据并在Python中创建摘要CSV文件

3 个答案: