我正在尝试编写一个python脚本,在大型文本文件中搜索Oracle错误号。这些文件没有保证的记录分隔符。因此我在多字节块中执行操作。
块内的正则表达式匹配似乎是一项微不足道的任务,但是我很难在块的开头或结尾处绕部分匹配。
要匹配的完整正则表达式是类似于以下
的oracle错误号`"ORA\-[0-9]{1,5}"`
如何编写匹配其子集的正则表达式?举个例子;块末尾的部分匹配将是以下之一:
(O$, OR$, ORA$, ORA\-$, ORA\-n$, or ORA\-nn$)
相反,在块的开头我会搜索
(^n, ^nn, ^\-nn, ^A\-nn, or ^RA\-nn)
将保存块末尾的部分匹配,以便与下一个块的开始进行比较。
积极的外观似乎很有希望,但与我要求的其他角色不匹配。可以通过正则表达式有效地执行这种查找方式吗?
答案 0 :(得分:1)
我认为这里真正的答案是你不想在raw中使用正则表达式。正则表达式对于您想要做的事情来说有点过高。您需要的是 tokenizer 。标记化器是一种易于理解的技术,因为它是每个编译器的重要组成部分。这就是将文本分解为 lexemes 的内容,这些文本意味着什么。这里对您很重要的关键特性是,标记器一次查看一个字符以标记源字符串。此特性允许您流式传输文件而不是以块的形式加载文件,并避免划分块的所有肮脏。
tokenizer只是有限状态机的一种实现。 (您应该注意,正则表达式也只是有限状态机的定义。)您所要做的就是确定您的状态以及何时创建词法。由于你有一小部分状态可供使用,这实际上并不那么难。这个想法很基本。你编写了一个大的if / else块,它首先检查你所处的当前状态(通过查看前面的字符得到的),然后根据当前字符的内容检查一些更多的条件逻辑。
顺便说一句,如果你想更好地理解所有这些东西,请参加编译器课程。您将在其中学习的概念和技术非常对复杂的文本处理非常有用。当你正在构建处理文本的东西时,它们会成为一个很好的解决方案,这有点令人惊讶。
Tokenizer代码往往有点冗长和丑陋,但它非常标准。它或多或少遵循标准模式的事实使其相对易于理解,但最重要的是,工作。我在下面写了一个。编写多个数字的检查可能有更短的方法,但我只是做了很长的路,以便更容易理解正在发生的事情。我实际上没有测试过这段代码,所以要彻底测试并调试,但逻辑应该是合理的。祝你好运。
import re
# Gonna be using this a lot, so compile it.
digit_pattern = re.compile('[0-9]')
# We're creating a class because there's a little bit of state to maintain.
class OracleErrorFinder(object):
def __init__(self, input_file):
self.input_file = input_file
# This seems weird, but there's a good reason.
# When we get to the end of a match, we're going to have already consumed
# the next character from the file. So we need to save it for the next round.
next_char = None
def find_next_match(self):
# Possible states are
# '': We haven't found any portion of the pattern yet.
# 'O': We found an O
# 'R': We found an OR
# 'A': We found an ORA
# '-': We found an ORA-
# 'num1': We found ORA-[0-9]
# 'num2': We found ORA-[0-9][0-9]
# 'num3': We found ORA-[0-9][0-9][0-9]
# 'num4': We found ORA-[0-9][0-9][0-9][0-9]
# 'num5': We found ORA-[0-9][0-9][0-9][0-9][0-9], and we're done
current_state = ''
match_so_far = ''
done = False
while not done:
if self.next_char:
# If we have a leftover char from last time,
# start with that and clear it.
c = self.next_char
self.next_char = None
else:
c = self.input_file.read(1)
if '' == c:
match_so_far = None
done = True # End of stream and we didn't find a match. Time to stop.
elif '' == current_state and 'O' == c:
# We found the start of what we're looking for.
# We don't know if it's the whole thing,
# so we just save it and go to the next character.
current_state = 'O'
match_so_far = 'O'
elif 'O' == current_state and 'R' == c:
# We already have an O and now we found the next character!
current_state = 'R'
match_so_far += c
elif 'R' == current_state and 'A' == c:
current_state = 'A'
match_so_far += c
elif 'A' == current_state and '-' == c:
current_state = '-'
match_so_far += c
elif '-' == current_state and digit_pattern.match(c):
current_state = 'num1'
match_so_far += c
elif 'num1' == current_state:
if digit_pattern.match(c):
current_state = 'num2'
match_so_far += c
else:
# We found a full match,
# but not more numbers past the last one.
# Time to return what we found.
done = True
elif 'num2' == current_state:
if digit_pattern.match(c):
current_state = 'num3'
match_so_far += c
else:
# We found a full match,
# but not more numbers past the last one.
# Time to return what we found.
done = True
elif 'num3' == current_state:
if digit_pattern.match(c):
current_state = 'num4'
match_so_far += c
else:
# We found a full match,
# but not more numbers past the last one.
# Time to return what we found.
done = True
elif 'num4' == current_state:
if digit_pattern.match(c):
current_state = 'num5'
match_so_far += c
else:
# We found a full match,
# but not more numbers past the last one.
# Time to return what we found.
done = True
elif 'num5' == current_state:
# We're done for sure!
# Note that we read the next character from the file.
# Important for code after the loop.
done = True
else:
# We didn't find the next character we wanted.
if 'O' == c:
# We didn't find a full match, but this starts
# a new one.
current_state = 'O'
match_so_far = 'O'
else:
# This character doesn't match our pattern.
# It could be a character that's in the wrong place
# (such as the - in OR-) or a character that just
# doesn't appear in the pattern at all (like X).
# We might be in the middle of a partial
# match, so throw everything found so far away
# and keep going.
current_state = ''
match_so_far = ''
# Save next char already consumed from file stream.
# Could be empty string if we consumed the whole file,
# but that's fine.
self.next_char = c
return match_so_far
with open(filename) as f:
finder = OracleErrorFinder(f)
while True:
match = finder.find_next_match()
if None is match:
break
# Print, send to file, add to list, what have you