无法使用Python中的正则表达式检索所有组

时间:2016-06-30 21:03:37

标签: python regex python-3.x python-3.5

我是Python的新手并且编写了正则表达式,我能够匹配完整的模式,但无法检索所有捕获的匹配,如下所示。有人可以帮我解决以下内容

正则表达式

(?i)(?:sections?|\&\#xA7\;|Treas\.\s*Reg\.)\s*((\d+\.\d+\-\d+(?:\((?:[a-zA-Z]|[0-9]+|[i v x]+)\))*)(?:\s*(?:and|\,|\,\s*and)\s*)*)+")

输入要匹配的内容:

1. sections 1.1441-1(e)(4)(iv)(C) and 1.1471-3(c)(6)(iv), 1.576-4(a)(9) and 1.32-12(h)(l)
2. sections 1.1441-1(e)(12)(i)(23) and 1.11-3(3)(4)(i) , 1.67-9(k)(10) and 1.78-8

输入:

Q11。有一张W-8表格已经由收款人填写并签名,已扫描     成图像或便携式文档格式(PDF),并上传到第三方     存储库已被扣缴义务人员以电子方式扫描和接收     第1.1441-1(e)(4)(iv)(C)和1.1471-3(c)(6)(iv),1.576-4(a)(9)和1.32-12(h)的用途)(l)如果收款人,则     要求扣缴义务人申请表格W-8,以记录其第1.1441-1(e)(12)(i)(23)和1.11-3(3)(4)(i),1.67-部分9(k)(10)和1.78-8 状态     第3章和第4章的目的是向扣缴义务人发送带有链接的电子邮件     到允许预扣代理的第三方存储库网站     下载存储在存储库中的表单的图像或PDF     此类目的(或收款人以其他方式授权扣缴义务人)     以类似的方式从第三方存储库访问特定表单。)

我的输出

['1.32-12(h)(l)','1.78-8']

期望输出:

['1.1441-1(e)(4)(iv)(C)', '1.1471-3(c)(6)(iv)', '1.576-4(a)(9)', '1.32-12(h)(l)', '1.1441-1(e)(12)(i)(23)', '1.11-3(3)(4)(i)', '1.67-9(k)(10)', '1.78-8']

代码:

import re
class Regex:

    def __init__(self,inputtext,regex):
        self.regex=regex
        self.inputtext=inputtext
    def blowCiteQuery(self,inputtext,regex):
        mo=regex.findall(inputtext)
        print(mo)



if __name__ == "__main__":

        text="""Q11. Has a Form W--8 that has been completed and signed by a payee, scanned
into an image or portable document format (PDF), and uploaded to a third--party
repository been scanned and received electronically by a withholding agent for
purposes of sections 1.1441-1(e)(4)(iv)(C) and 1.1471-3(c)(6)(iv), 1.576-4(a)(9) and 1.32-12(h)(l) if the payee, upon
request from the withholding agent for a Form W--8 to document its sections 1.1441-1(e)(12)(i)(23) and 1.11-3(3)(4)(i) , 1.67-9(k)(10) and 1.78-8 status for
purposes of chapters 3 and 4, sends the withholding agent an email with a link
to the third--party repository site that allows the withholding agent to
download the image or PDF of the form that is stored on the repository for
such purpose (or the payee otherwise authorizes the withholding agent to
access the specific form from the third--party repository in a similar manner)."""

        regex=re.compile("(?i)(?:sections?|\&\#xA7\;|Treas\.\s*Reg\.)\s*((\d+\.\d+\-\d+(?:\((?:[a-zA-Z]|[0-9]+|[i v x]+)\))*)(?:\s*(?:and|\,|\,\s*and)\s*)*)+")
        treglinkval=Regex(text,regex)
        treglinkval.blowCiteQuery(text,regex)

2 个答案:

答案 0 :(得分:1)

你可以去:

sections\ 
(?P<section1>[-.\w()]+)
\ and\ 
(?P<section2>[-.\w()]+)
(?P<optional>
    (?:,\ [-.\w()]+)+
)?

之后使用re.finditer(),请参阅a demo on regex101.com 这将捕获组中的逗号分隔组&#34;可选&#34;,您需要以编程方式用逗号分隔这些组。

import re
rx = re.compile("""
sections\ 
(?P<section1>[-.\w()]+)
\ and\ 
(?P<section2>[-.\w()]+)
(?P<optional>
    (?:,\ [-.\w()]+)+
)?""", re.VERBOSE)

sections = [(m.group('section1'), m.group('section2')) for m in rx.finditer(your_text_here)]
print sections
# [('1.1441-1(e)(4)(iv)(C)', '1.1471-3(c)(6)(iv)'), ('1.1441-1(e)(12)(i)(23)', '1.11-3(3)(4)(i)')]

完整的demo can be found on ideone.com

答案 1 :(得分:0)

删除(?i),正则表达式适用于您的示例...您还使用非捕获组,并且正则表达式不会为您提供重复相同捕获组的单独组,它将保留最后一个,这就是为什么你只看到它输出只是同一件事的两个副本。

就个人而言,我只是抓住整个比赛,并做了类似的事情:

result = []
for match in matches:
    pieces = match.split(' ')
    result.append((pieces[1], pieces[3]))
return result

给出了

('1.1441-1(e)(12)(i)(23)', '1.11-3(3)(4)(i)') 每场比赛。