关于扫描文本文件,我有一个棘手的搜索方案,并且希望能够通过分解或任何有用的模块来处理方案的最佳方法。我有一个下面示例形式的文本文件,我正在寻找像“test1(OK)test2(OK)”这样的文本序列。如果满足此搜索模式,则需要返回该文件并查找另一个字符串“字符串组A”的最后4个条目,并从这些先前字符串组中捕获每个字符串组的“有用信息”。为了使事情变得更加困难,我为'B'设置了类似的信息组,这使得事情变得棘手,我必须为所有Group'B'信息执行相同的过程!
String Group A
Useful information for A
String Group A
Useful information for A
String Group B
Useful information for B
String Group A
Useful information for A
String Group B
Useful information for B
String Group A
Useful information for A
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for B” from “String Group B”
String Group B
Useful information for B
String Group A
Useful information for A
And so on…
就像我说的那样,我正在寻找有关最佳前进方向的想法,因为此文本文件中的收集信息似乎过多地跳了起来。我有一个想法,看看并计算'字符串组A'为行(x),然后当满足“test1(OK)test2(OK)”条件时返回到行(x)和行(x-1)和行(x-2)和行(x-3)并抓住每个“A的有用信息”,但我不相信这是最好的前进方式。我应该指出,文本文件很大,并且包含1000个字符串组A和B的条目。
感谢阅读,
MikG
答案 0 :(得分:1)
我解释它的问题是找到特定模式的出现列表,并从该列表中提取一个文本块。以下find_all()例程从字符串中提取模式(子)的所有出现。以下示例概述了如何使用它来获取测试结果,但这取决于查找后续的end_pattern。
def find_all(s, sub):
indxs = []
start = 0
ns = len(s)
nsub = len(sub)
while True:
indx = s.find(sub, start, ns)
if indx < 0: break
indxs.append(indx)
start = indx + nsub; print(start)
return indxs
使用草图,给定字符串(test_results)和字符串组A(group_A_pattern)以及结束的模式&#34; A&#34的有用信息; (end_group_pattern):
def get_test_results(test_results, group_A_pattern, end_group_pattern):
starts = find_all(test_results, group_A_pattern)
useful_A = []
for start0 in starts[-4:]:
start = start0 + len(group_A_pattern)
stop = test_results.find(end_group_pattern, start)
useful_A.append(test_results[start:stop])
return useful_A
这是测试代码:
test_results = 'groupA some-useful end junk groupA more-useful end whatever'
group_A_pattern = 'groupA'
end_group_pattern = 'end'
get_test_results(test_results, group_A_pattern, end_group_pattern)
运行上述测试代码会产生:
[' some-useful ', ' more-useful ']
答案 1 :(得分:1)
以下是如何定义一个循环向量类,它只跟踪从上到下处理文件时可能需要的数据。它具有相当多的注释,因此可以理解它,而不仅仅是代码转储。解析的细节当然在很大程度上取决于输入的确切含义。我的代码基于您可能需要更改的示例文件进行假设。例如,使用startswith()可能过于严格,具体取决于您的输入,您可能希望使用find()代替。
<强>代码强>
from __future__ import print_function
import sys
from itertools import chain
class circ_vec(object):
"""A circular fixed vector.
"""
# The use of slots drastically reduces memory footprint of Python classes -
# it removes the need for a hash table for every object
__slots__ = ['end', 'elems', 'capacity']
# end will keep track of where the next element is to be added
# elems holds the last X elemenst that were added
# capacity is how many elements we will hold
def __init__(self, capacity):
# we only need to specify the capacity up front
# elems is empty
self.end = 0
self.elems = []
self.capacity = capacity
def add(self, e):
new_index = self.end
if new_index < len(self.elems):
self.elems[new_index] = e
else:
# If we haven't seen capacity # of elements yet just append
self.elems.append(e)
self.end = (self.end + 1) % self.capacity
def __len__(self):
return len(self.elems)
# This magic method allows brace [ ] indexing
def __getitem__(self, index):
if index >= len(self.elems):
print("MY RAISE")
raise IndexError
first = self.capacity - self.end - 1
index = (index + first) % self.capacity
# index = (self.end + key ) % self.capacity
# print("LEN = ", len(self.elems))
# print("INDEX = ", index)
return self.elems[index]
# This magic method allows iteration
def __iter__(self):
if not self.elems:
return iter([])
elif len(self.elems) < self.capacity:
first = 0
else:
first = self.end
# Iterate from the oldest element to the newest
return chain( iter(self.elems[first:]), iter(self.elems[:first]) )
string_group_last_four = { k : circ_vec(4) for k in ['A', 'B'] }
with open(sys.argv[1], 'r') as f:
string_group_context = None
# We will manually iterate through the file. Get an iterator using iter().
it = iter(f)
# As per the example, the file we're reading groups lines in twos.
buf = circ_vec(2)
try:
while(True):
line = next(it)
buf.add(line.strip())
# The lines beginning with 'String Group' should be recorded in case we need them later.
if line.startswith('String Group'):
# Here is the benefit of manual iteration. We can call next() more than once per loop iteration.
# Sometimes once we've read a line, we just want to immediately get the next line.
# strip() removes whitespace and the newline characters
buf.add(next(it).strip())
# How exactly you will parse your lines depends on your needs. Here, I assume that the last word in
# the current line is an identifier that we are interested in.
string_group = line.strip().split()[-1]
# Add the lines in the buffer to the circular vector belonging to the identifier.
string_group_last_four[string_group].add( list(l for l in buf) )
buf = circ_vec(2)
# For lines beginning with 'Other Main String for', we need to
# remember the identifier but there's no other information to
# record.
elif line.startswith('Other Main String for'):
string_group_context = line.strip().split()[-1]
# Use find() instead of startswith() because the
# 'test1(OK) # test2(OK)' lines begin with whitespace. startswith()
# would depend on the specific whitespace characters which could
# be confusing.
elif line.find('test1(OK) test2(OK)') != -1:
print('String group' + string_group_context + ' has a test hit!')
# Print out the test lines.
for l in buf: print(l)
print('Four most recent "String Group ' + string_group_context + '" lines:')
# Use the identifier dict to get the last 4 relevant groups of lines
for cv in string_group_last_four[string_group_context]:
for l in cv: print(l)
# Manual iteration is terminated by an exception in Python. Catch and swallow it
except StopIteration: pass
print("Done!")
测试文件内容。 我试着让一点点运行代码变得有点奇怪。
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
String Group 1 A
Useful information for A
String Group 2 A
Useful information for A
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
String Group 1 B
Useful information for B
String Group 3 A
Useful information for A
String Group 2 B
Useful information for B
String Group 4 A
Useful information for A
String Group 5 A
Useful information for A
String Group 6 A
Useful information for A
String Group 3 B
Useful information for B
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
String Group 4 B
Useful information for B
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
String Group 7 A
Useful information for A
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
<强>输出强>
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String Group 1 A
Useful information for A
String Group 2 A
Useful information for A
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String Group 3 A
Useful information for A
String Group 4 A
Useful information for A
String Group 5 A
Useful information for A
String Group 6 A
Useful information for A
String groupB has a test hit!
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group B" lines:
String Group 1 B
Useful information for B
String Group 2 B
Useful information for B
String Group 3 B
Useful information for B
String groupB has a test hit!
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group B" lines:
String Group 1 B
Useful information for B
String Group 2 B
Useful information for B
String Group 3 B
Useful information for B
String groupB has a test hit!
Other Main String for B
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group B" lines:
String Group 1 B
Useful information for B
String Group 2 B
Useful information for B
String Group 3 B
Useful information for B
String Group 4 B
Useful information for B
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String Group 4 A
Useful information for A
String Group 5 A
Useful information for A
String Group 6 A
Useful information for A
String Group 7 A
Useful information for A
Done!