使用Python在txt文件中搜索多个字符串

时间:2014-06-23 10:08:28

标签: python

关于扫描文本文件,我有一个棘手的搜索方案,并且希望能够通过分解或任何有用的模块来处理方案的最佳方法。我有一个下面示例形式的文本文件,我正在寻找像“test1(OK)test2(OK)”这样的文本序列。如果满足此搜索模式,则需要返回该文件并查找另一个字符串“字符串组A”的最后4个条目,并从这些先前字符串组中捕获每个字符串组的“有用信息”。为了使事情变得更加困难,我为'B'设置了类似的信息组,这使得事情变得棘手,我必须为所有Group'B'信息执行相同的过程!

String Group A
    Useful information for A

String Group A
    Useful information for A

String Group B
    Useful information for B

String Group A
    Useful information for A

String Group B
    Useful information for B

String Group A
    Useful information for A

Other Main String for A
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

Other Main String for B
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for B” from “String Group B”

String Group B
    Useful information for B

String Group A
    Useful information for A

And so on…

就像我说的那样,我正在寻找有关最佳前进方向的想法,因为此文本文件中的收集信息似乎过多地跳了起来。我有一个想法,看看并计算'字符串组A'为行(x),然后当满足“test1(OK)test2(OK)”条件时返回到行(x)和行(x-1)和行(x-2)和行(x-3)并抓住每个“A的有用信息”,但我不相信这是最好的前进方式。我应该指出,文本文件很大,并且包含1000个字符串组A和B的条目。

感谢阅读,

MikG

2 个答案:

答案 0 :(得分:1)

我解释它的问题是找到特定模式的出现列表,并从该列表中提取一个文本块。以下find_all()例程从字符串中提取模式(子)的所有出现。以下示例概述了如何使用它来获取测试结果,但这取决于查找后续的end_pattern。

def find_all(s, sub):
    indxs = []
    start = 0
    ns = len(s)
    nsub = len(sub)
    while True:
        indx = s.find(sub, start, ns)
        if indx < 0: break
        indxs.append(indx)
        start = indx + nsub; print(start)
    return indxs

使用草图,给定字符串(test_results)和字符串组A(group_A_pattern)以及结束的模式&#34; A&#34的有用信息; (end_group_pattern):

def get_test_results(test_results, group_A_pattern, end_group_pattern):
    starts = find_all(test_results, group_A_pattern)
    useful_A = []
    for start0 in starts[-4:]:
        start = start0 + len(group_A_pattern)
        stop = test_results.find(end_group_pattern, start)
        useful_A.append(test_results[start:stop])
    return useful_A

这是测试代码:

test_results = 'groupA some-useful end junk groupA more-useful end whatever'
group_A_pattern = 'groupA'
end_group_pattern = 'end'
get_test_results(test_results, group_A_pattern, end_group_pattern)

运行上述测试代码会产生:

[' some-useful ', ' more-useful ']

答案 1 :(得分:1)

以下是如何定义一个循环向量类,它只跟踪从上到下处理文件时可能需要的数据。它具有相当多的注释,因此可以理解它,而不仅仅是代码转储。解析的细节当然在很大程度上取决于输入的确切含义。我的代码基于您可能需要更改的示例文件进行假设。例如,使用startswith()可能过于严格,具体取决于您的输入,您可能希望使用find()代替。

<强>代码

from __future__ import print_function
import sys
from itertools import chain

class circ_vec(object):
    """A circular fixed vector.
    """
    # The use of slots drastically reduces memory footprint of Python classes -
    # it removes the need for a hash table for every object
    __slots__ = ['end', 'elems', 'capacity']
    # end will keep track of where the next element is to be added
    # elems holds the last X elemenst that were added
    # capacity is how many elements we will hold

    def __init__(self, capacity):
        # we only need to specify the capacity up front
        # elems is empty
        self.end = 0
        self.elems = []
        self.capacity = capacity

    def add(self, e):
        new_index = self.end
        if new_index < len(self.elems):
            self.elems[new_index] = e
        else:
            # If we haven't seen capacity # of elements yet just append
            self.elems.append(e)
        self.end = (self.end + 1) % self.capacity

    def __len__(self):
        return len(self.elems)

    # This magic method allows brace [ ] indexing
    def __getitem__(self, index):
        if index >= len(self.elems):
            print("MY RAISE")
            raise IndexError
        first = self.capacity - self.end - 1
        index = (index + first) % self.capacity
        # index = (self.end + key ) % self.capacity
        # print("LEN = ", len(self.elems))
        # print("INDEX = ", index)
        return self.elems[index]

    # This magic method allows iteration
    def __iter__(self):
        if not self.elems:
            return iter([])
        elif len(self.elems) < self.capacity:
            first = 0
        else:
            first = self.end
        # Iterate from the oldest element to the newest
        return chain( iter(self.elems[first:]), iter(self.elems[:first]) )

string_group_last_four = { k : circ_vec(4) for k in ['A', 'B'] }
with open(sys.argv[1], 'r') as f:
    string_group_context = None
    # We will manually iterate through the file.  Get an iterator using iter().
    it = iter(f)
    # As per the example, the file we're reading groups lines in twos.
    buf = circ_vec(2)
    try:
        while(True):
            line = next(it)
            buf.add(line.strip())
            # The lines beginning with 'String Group' should be recorded in case we need them later.
            if line.startswith('String Group'):
                # Here is the benefit of manual iteration.  We can call next() more than once per loop iteration.
                # Sometimes once we've read a line, we just want to immediately get the next line.
                # strip() removes whitespace and the newline characters
                buf.add(next(it).strip())
                # How exactly you will parse your lines depends on your needs. Here, I assume that the last word in
                # the current line is an identifier that we are interested in.
                string_group = line.strip().split()[-1]
                # Add the lines in the buffer to the circular vector belonging to the identifier.
                string_group_last_four[string_group].add( list(l for l in buf) )
                buf = circ_vec(2)
            # For lines beginning with 'Other Main String for', we need to
            # remember the identifier but there's no other information to
            # record.
            elif line.startswith('Other Main String for'):
                string_group_context = line.strip().split()[-1]
            # Use find() instead of startswith() because the
            # 'test1(OK) # test2(OK)' lines begin with whitespace. startswith()
            # would depend on the specific whitespace characters which could
            # be confusing.
            elif line.find('test1(OK) test2(OK)') != -1:
                print('String group' + string_group_context + ' has a test hit!')
                # Print out the test lines.
                for l in buf: print(l)
                print('Four most recent "String Group ' + string_group_context + '" lines:')
                # Use the identifier dict to get the last 4 relevant groups of lines
                for cv in string_group_last_four[string_group_context]:
                    for l in cv: print(l)
    # Manual iteration is terminated by an exception in Python.  Catch and swallow it
    except StopIteration: pass
print("Done!")

测试文件内容。 我试着让一点点运行代码变得有点奇怪。

Other Main String for A
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

String Group 1 A
    Useful information for A

String Group 2 A
    Useful information for A

Other Main String for A
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

String Group 1 B
    Useful information for B

String Group 3 A
    Useful information for A

String Group 2 B
    Useful information for B

String Group 4 A
    Useful information for A

String Group 5 A
    Useful information for A

String Group 6 A
    Useful information for A

String Group 3 B
    Useful information for B

Other Main String for A
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

Other Main String for B
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

Other Main String for B
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

String Group 4 B
    Useful information for B

Other Main String for B
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

String Group 7 A
    Useful information for A

Other Main String for A
    test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”

<强>输出

String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String Group 1 A
Useful information for A
String Group 2 A
Useful information for A
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String Group 3 A
Useful information for A
String Group 4 A
Useful information for A
String Group 5 A
Useful information for A
String Group 6 A
Useful information for A
String groupB has a test hit!
Other Main String for B
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group B" lines:
String Group 1 B
Useful information for B
String Group 2 B
Useful information for B
String Group 3 B
Useful information for B
String groupB has a test hit!
Other Main String for B
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group B" lines:
String Group 1 B
Useful information for B
String Group 2 B
Useful information for B
String Group 3 B
Useful information for B
String groupB has a test hit!
Other Main String for B
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group B" lines:
String Group 1 B
Useful information for B
String Group 2 B
Useful information for B
String Group 3 B
Useful information for B
String Group 4 B
Useful information for B
String groupA has a test hit!
Other Main String for A
test1(OK) test2(OK)  *** Condition Met *** #Now go back and collect the last 4 entries of “Useful information for A” from “String Group A”
Four most recent "String Group A" lines:
String Group 4 A
Useful information for A
String Group 5 A
Useful information for A
String Group 6 A
Useful information for A
String Group 7 A
Useful information for A
Done!