我从日志文件中提取了很大的字符串(7-10k个字符),我需要从这些字符串中自动提取信息并制成表格。每个字符串包含大约40个由不同人员输入的值。例子;
Example string 1.) 'Color=Blue, [randomJunkdataExampleHere] Weight=345Kg, Age=34 Years, error#1 randomJunkdataExampleThere error#1'
Example string 2.) '[randomJunkdataExampleHere] Color=Red 42, Weight=256 Lbs., Age=34yers, error#1, error#2'
Example string 3.) 'Color=Yellow 13,Weight=345lbs., Age=56 [randomJunkdataExampleHere]'
期望的结果是一个新的字符串,甚至是一个字典,它组织数据并为数据库输入做好准备(每行数据一个字符串);
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
考虑对每个列/值使用re.search,但是由于用户输入数据的方式存在差异,因此我不知道如何仅捕获要提取的数字。还不知道如何捕获字符串中出现“ Error#1Count”的次数。
import re
line = '[randomJunkdataExampleHere] Color=Blue, Weight=345Kg, Age=34 Years, error#1, randomJunkdataExampleThere error#1'
try:
Weight = re.search('Weight=(.+?), Age',line).group(1)
except AttributeError:
Weight = 'ERROR'
目标/结果:
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
答案 0 :(得分:0)
如上所述,10000个字符确实不算什么。
import time
example_string_1 = 'Color=Blue, Weight=345Kg, Age=34 Years, error#1, error#1'
example_string_2 = 'Color=Red 42, Weight=256 Lbs., Age=34 yers, error#1, error#2'
example_string_3 = 'Color=Yellow 13, Weight=345lbs., Age=56'
def run():
examples = [example_string_1, example_string_2, example_string_3]
dict_list = []
for example in examples:
# first, I would suggest tokenizing the string to identify individual data entries.
tokens = example.split(', ')
my_dict = {}
for token in tokens: # Non-error case
if '=' in token:
subtokens = token.split('=') # this will split the token into two parts, i.e ['Color', 'Blue']
my_dict[subtokens[0]] = subtokens[1]
elif '#' in token: # error case. Not sure if this is actually the format. If not, you'll have to find something to key off of.
if 'error' not in my_dict or my_dict['error'] is None:
my_dict['error'] = [token]
else:
my_dict['error'].append(token)
dict_list.append(my_dict)
# Now lets test out how fast it is.
before = time.time()
for i in range(100000): # run it a hundred thousand times
run()
after = time.time()
print("Time: {0}".format(after - before))
收益率:
Time: 0.5782015323638916
看到了吗?还不错现在剩下的就是遍历字典并记录所需的指标。