在停顿时用正则表达式解析一些法规

时间:2014-07-20 19:04:08

标签: python regex

我正在解析一个庞大的法规文件,我有一个非标准法规的特定正则表达式,因为它们与通常的模式不匹配。这是我正在使用的正则表达式:

\n(\d*[A-Z]?-\d*[A-Z]?-\d*[\.\d]*[A-Z]?[-\d*[\.\d]*[A-Z]?]?)(?= (?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)\.\s*\n)(?:\s|\stt.*|\.)(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed).\s*\n(.*?)\n\d*[A-Z]?-\d*[A-Z]?-\d*[\.\d]*[A-Z]?[-\d*[\.\d]*[A-Z]?]?

除少数问题外,这种方法很有效。

  1. 当两个特殊情况相继出现时,它不起作用;例如:

    34A-1-28废除。     34A-1-28。由SL 1986,ch 295,§7废除。

    34A-1-28废除。     34A-1-28。由SL 1986,ch 295,§7废除。

  2. 当法规如此显示时,它不起作用:34A-6-88, Transferred.(法规后的逗号)
  3. 当列出范围时,它不起作用:34A-6-88 to 23-34-1A Repealed.
  4. 任何帮助解决这三个问题的人都将不胜感激。为了方便起见,我设置了一个regex101,其中包含了我要标记here的一大部分法规。

2 个答案:

答案 0 :(得分:1)

如果您需要复杂的正则表达式,则必须逐步构建它。这是避免迷路的唯一方法。

我们开始前的两点说明:

  • 我不熟悉法律术语。我的术语可能都错了。

  • 我将使用verbose flag。使用此标志,您可以在正则表达式中自由插入空格以提高可读性。

让我们从法规编号开始,定义一个解析单个组件的正则表达式(例如34A83.1)。

nbr = r'\d+ (?: \. \d+ )? [A-Z]?'

这些组件中的三到五个由破折号分隔,形成完整的法规编号。

statute = r'%(nbr)s (?: - %(nbr)s ){2,4}' % {
    'nbr': nbr
}

有了这个,我们可以轻松定义一个匹配单个法规和范围的正则表达式。我们使用两组来捕捉法规。第二个将是空的,没有给出范围。

statute_or_range = r'(%(statute)s) (?: \s+ to \s+ (%(statute)s) )?' % {
    'statute': statute
}

现在我们可以构建一个匹配整个第一行的模式。此时,处理有时出现的逗号很容易。

action = r'(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)'

first_line = r'%(statute_or_range)s ,? \s+ %(action)s \. \s+' %{
    'statute_or_range': statute_or_range,
    'action': action
}

我不清楚你想要匹配多少文字。我的印象是,您希望捕获下一个法规的开头,该法规定义为以法规编号开头的行。所以:

end = r'(?= \n %(statute)s )' % {
    'statute': statute
}

将这些结合起来,你就有了正则表达式:

pattern = r'%(first_line)s (.*?) %(end)s' % {
    'first_line': first_line,
    'end': end
}

regex = re.compile(pattern, re.VERBOSE | re.DOTALL | re.IGNORECASE)

See it in action.

答案 1 :(得分:1)

示例文字:

34A-6-87.1 Disposal of tire waste--Collection or processing sites--Penalties for violations.
     34A-6-87.1. Disposal of tire waste--Collection or processing sites--Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-6-89-34 Scale device required--Records--Report--Contents--Permit for longer capacity disposal.
     34A-6-89. Scale device required--Records--Report--Contents--Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

我假设您要将该文本拆分为由引用的法规分隔的块。

如果是这样,简化你的正则表达式。你可以这样做:

'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?\s+.*?(?=\n\n|\n+\Z|\Z))'

^ assert position at start of a line
1st Capturing group (\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))
\d+ match a digit [0-9]
\w+ match any word character [a-zA-Z0-9_]
- matches the character - literally
\d+ match a digit [0-9]
- matches the character - literally
\d+ match a digit [0-9]
(?:[,.\-0-9A-Z]+)? Non-capturing group
[ \t]+ match a single character present in the list below
.*? matches any character
(?=\n\n|\n+\Z|\Z) Positive Lookahead - Assert that the regex below can be matched
1st Alternative: \n\n
2nd Alternative: \n+\Z
3rd Alternative: \Z
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
s modifier: single line. Dot matches newline characters

注意:

  1. 使用锚^re.S | re.M
  2. 结合使用
  3. (?=\n\n|\n+\Z|\Z)的正面向前移动到最后。
  4. Example in regex101

    一旦拥有了各个块,就可以进一步解析这些块以找到所需的块。举个简单的例子:

    statutes={}
    pat=re.compile(r'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))', re.S | re.M)
    for block in pat.finditer(txt):
        m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌​ed)', block.group(1))
        if m:
            statutes.setdefault(m.group(1), []).append(block.group(1))
        else:
            statutes.setdefault('Enacted', []).append(block.group(1))    
    
    for status in sorted(statutes):
        print '{} ============\n{}\n'.format(status, '\n\n'.join(statutes[status]))  
    

    将示例文本分成各种法规的状态(已颁布,废除,提交等)

    像这样:

    Enacted ============
    34A-6-87.1 Disposal of tire waste--Collection or processing sites--Penalties for violations.
         34A-6-87.1. Disposal of tire waste--Collection or processing sites--Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
    Source:
      SL 1998, ch 202, § 1.          Source:
    
    34A-6-89-34 Scale device required--Records--Report--Contents--Permit for longer capacity disposal.
         34A-6-89. Scale device required--Records--Report--Contents--Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
    Source:
      SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:
    
    Repealed ============
    34A-1-28 Repealed.
         34A-1-28. Repealed by SL 1986, ch 295, § 7.
    
    34A-1-28 Repealed.
         34A-1-28. Repealed by SL 1986, ch 295, § 7.
    
    Transferred ============
    34A-6-8-2A Transferred.
         34A-6-88. Transferred to § 46A-1-83.1.
    
    34A-6-8-2A Transferred.
         34A-6-88. Transferred to § 46A-1-83.1.
    
    34A-6-88, Transferred.
         34A-6-88. Transferred to § 46A-1-83.1.
    
    34A-6-88 to 23-34-1A Transferred.
         34A-6-88. Transferred to § 46A-1-83.1.
    

    作为你的正则表达式SIMPLE的一个例子,至少在示例文本中,你可以使用Python的split方法和\n\n返回来获得相同的结果:< / p>

    statutes={}
    for block in txt.split('\n\n'):
        m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌​ed)', block)
        if m:
            statutes.setdefault(m.group(1), []).append(block)
        else:
            statutes.setdefault('Enacted', []).append(block)   
    # etc