Question

我有一个.txt文件，看起来像那样：

Epoch [1]   Iteration [0/51]    Training Loss 1.6095 (1.6095)   Training Accuracy 14.844    Epoch [1]   Iteration [10/51]

以下代码有什么问题？它返回空列表。

accuracy = list(map(lambda x: x.split('\t')[-1], re.findall(r"\'Training Accuracy\': \d+.\d+", file)))
print(accuracy)
loss = list(map(lambda x: x.split('\t')[-1], re.findall(r"\'Training Loss\': \d.\d+", file)))
print(loss)
epoch = list(map(lambda x: x.split('\t')[-1], re.findall(r"\'Epoch\': \d", file)))
print(epoch)

谢谢！

Answer 1

这个x.split('\t')[-1]只会给出分割字符串的最后一个块，而所需的子字符串位于不同的块上。

使用以下re.search()解决方案：

import re

s = 'Epoch [1]   Iteration [0/51]    Training Loss 1.6095 (1.6095)   Training Accuracy 14.844    Epoch [1]   Iteration [10/51]'
pat = re.compile(r'(Training Loss \d+\.\d+).+(Training Accuracy \d+\.\d+).+(Epoch \[\d+\])')
loss, accuracy, epoch = pat.search(s).groups()

print(loss, accuracy, epoch, sep='\n')

输出（连续）：

Training Loss 1.6095
Training Accuracy 14.844
Epoch [1]

Answer 2

假设您需要提取实体的密钥（名称）和值。我发布了这个代码，它自动检测并将名称映射到数字

import re
extracted_data = """Epoch [1]   Iteration [0/51]    Training Loss 1.6095 (1.6095)   Training Accuracy 14.844    Epoch [1]   Iteration [10/51]""" #extracted data from the file
splited_data = re.split('([ ]{2,}|\t|\n)', extracted_data) #split the text into chunks with (tabs, newline, spaces more than 2)
re_word = '[a-z A-Z]*' #extractes the word part
re_dig = '[\d.]*' #extract the digit part
#Get key value pairs and make it as dict 
data = {re.findall(re_word, text)[0].strip(): {'full_text': text, 'digit':filter(lambda a: a.strip(), re.findall(re_dig, text)) } for text in splited_data if text.strip()}
print 'Training Accuracy :',data['Training Accuracy']['digit']
print 'Training Loss:',data['Training Loss']['digit']
print 'Epoch:',data['Epoch']['digit']

print data.keys() # this will give you the names extracted.

输出：

Training Accuracy : ['14.844']
Training Loss: ['1.6095', '1.6095']
Epoch: ['1']

python正则表达式 - 读取文本文件的一部分

2 个答案: