Question

我有一个需要匹配的部分。我的条件是：匹配所有内容，包括标题。标题的模式已经与我匹配，我需要匹配以“fig”开头的段落。我已经完成了它，但我发现它一旦遇到不匹配就会停止进一步匹配另一个条件是如果段落少于3个单词，则不匹配。

以下是示例文本：

List of tables and figure captions:

Figure 1 shows study area and locations of borewell and surface water sampling  points. Low lying area on the western side is clearly visible.


Figure 2 displays nothing much.
no match
here


Fig.y yhth hyt htyh hyt htyh th thyt htyht thh

Table xvnm,mcxnv  bvv nd vdm v

para之间可以有任意数行。这里发生的事情是，在从图2开始的段落中的行尾之后，这些单词不匹配，因为它们没有以“Fig”开头，但是后面的句子以“Fig”开头。我怎么可能将该行与Fig.y ??

匹配

这是我的正则表达式：

'((?:^(?:Supp[elmntary]*\s|list\sof\s)?[^\n]*Fig[ures]*[^\n]*(?:Captions?|Legends?|Lists?)[^\n])(?:(?!^)[^\n]+|(?!\n\w+\s*\w+\s*:?\s*$)\n|Fig)*)'

使用的标志：re.I，re.M，re.S （DOTALL）

我试着提前添加：

(?:.*^Fig[^\n]*$){0,}

但这不起作用，因为我找不到跳过包含"no match"和"here"的行的方法。

帮助表示感谢。我将使用re.findall。

Answer 1

新的答案有可能我还没有完全理解你的要求，但我会再接受一次破解。我假设可以从原始正则表达式中插入正确的正则表达式以捕获标题。

# Python 2.7
# Typos may exist, didn't test yet
import re

def emitRecord(matches):
  if len(matches) > 0:
    print "----- Start record -----"
    print "\n".join(matches)
    print "----- End record -----"

matches = []
seenTitle = False
titleRegex = re.compile(r'expression to capture titles here')
figureRegex = re.compile(r'^(?:fig|figure)[^a-z]', re.I)
with open('text.txt', 'r') as text:
  for line in text:
    if not line.strip(): continue
    if titleRegex.search(line):
      seenTitle = True
      emitRecord(matches)
      matches = [line.strip()]
    elif seenTitle:
      if len(line.split()) < 3: continue
      if figureRegex.search(line): matches.append(line.strip())
emitRecord(matches)

使用正则表达式，Python匹配标题下方的段落

1 个答案: