假设您要进行正则表达式搜索并通过管道提取,但模式可能会跨越多行,如何操作?也许正则表达式库适用于流?
我希望使用Python库做这个工作?但是任何解决方案都可以,库当然不是cmd行工具。
顺便说一句,我知道如何解决当前的问题,只是寻求一般解决方案。如果不存在这样的库,那么为什么常规库无法使用流,因为常规的mathing算法永远不需要后向扫描。
答案 0 :(得分:6)
如果您正在使用通用解决方案,那么您的算法需要看起来像:
match.end()
并转到步骤2. 如果找不到匹配项,最终可能会占用大量内存,但在一般情况下很难做得更好(考虑将.*x
与大文件中的多行正则表达式进行匹配唯一的x
是最后一个字符。)
如果您对regexp有更多了解,可能还有其他情况可以丢弃部分缓冲区。
答案 1 :(得分:2)
我解决了使用经典模式匹配搜索流的类似问题。您可能希望子类化我的解决方案streamsearch-py的Matcher类,并在缓冲区中执行正则表达式匹配。查看下面包含的kmp_example.py以获取模板。如果事实证明经典的Knuth-Morris-Pratt匹配就是您所需要的,那么现在使用这个小型开源库可以解决您的问题: - )
#!/usr/bin/env python
# Copyright 2014-2015 @gitagon. For alternative licenses contact the author.
#
# This file is part of streamsearch-py.
# streamsearch-py is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# streamsearch-py is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
# You should have received a copy of the GNU Affero General Public License
# along with streamsearch-py. If not, see <http://www.gnu.org/licenses/>.
from streamsearch.matcher_kmp import MatcherKMP
from streamsearch.buffer_reader import BufferReader
class StringReader():
"""for providing an example read() from string required by BufferReader"""
def __init__(self, string):
self.s = string
self.i = 0
def read(self, buf, cnt):
if self.i >= len(self.s): return -1
r = self.s[self.i]
buf[0] = r
result = 1
print "read @%s" % self.i, chr(r), "->", result
self.i+=1
return result
def main():
w = bytearray("abbab")
print "pattern of length %i:" % len(w), w
s = bytearray("aabbaabbabababbbc")
print "text:", s
m = MatcherKMP(w)
r = StringReader(s)
b = BufferReader(r.read, 200)
m.find(b)
print "found:%s, pos=%s " % (m.found(), m.get_index())
if __name__ == '__main__':
main()
输出
pattern of length 5: abbab
text: aabbaabbabababbbc
read @0 a -> 1
read @1 a -> 1
read @2 b -> 1
read @3 b -> 1
read @4 a -> 1
read @5 a -> 1
read @6 b -> 1
read @7 b -> 1
read @8 a -> 1
read @9 b -> 1
found:True, pos=5
答案 2 :(得分:-2)
我不相信可以在流上使用正则表达式,因为如果没有整个数据,则无法进行正匹配。这意味着您只有可能的匹配。
然而,正如@James Henstridge所说,你可以使用缓冲区来解决这个问题。