Question

所以我试图做一个从文本中返回所有引用（CITS）的函数，有时这个文本是我首先验证它的列表。

def get_cits_from_note(note):
    if note:
        if isinstance(note, list):
            note = "".join(note)
        matchGroups = re.findall(r'\|CITS\s*:*\s*\[\s*(\d+)', note)
        if matchGroups:
            citsList = [match for match in matchGroups]
            print citsList

文本将是这样的（文本是我从维基百科中复制/粘贴的东西，这就是为什么它没有任何意义）：

括号是高标点符号，通常用于文本中的匹配对，| CITS：[123]，[456]，[789] |分开或插入其他文本。匹配对最好描述为开头和| CITS：[999] |。不太正式，在从左到右的上下文中，它可以被描述为左和右，并且在从右到左的上下文中被描述为左右。

这是我建立的第一个正则表达式：

matchGroups = re.findall(r'\|CITS\s*:*\s*\[\s*(\d+)', note)

但它只会打印：

[u'123']

所以我做了第二个正则表达式：

matchGroups = re.findall(r'\|CITS\s*:*\s*((\[\s*(\d+)]+,*\s*)+)\|', note)

但它不会像我想要的那样工作，因为它会打印出来：

[(u'[123], [456], [789]', u'[789]', u'789'), (u'[999]', u'[999]', u'999')]

我一直在处理这个正则表达式，我无法设法让它工作，有人能告诉我我错过了什么吗？

最终输出应为：

[u'123',u'456',u'789',u'999']

Answer 1

import re
note = "A bracket is a tall punctuation mark typically used in matched pairs within text, |CITS: [123],[456],[789]| to set apart or interject other text. The matched pair is best described as opening and |CITS: [999]|. Less formally, in a left-to-right context, it may be described as left and right, and in a right-to-left context, as right and left."
matchGroups = re.findall(r'\d+', note)
print matchGroups

<强>输出：

['123', '456', '789', '999']

Answer 2

不仅仅是正则表达式，但如果我理解你的目标，可以这样做：

raw_list = [x.strip().split(',')
            for x in re.findall(r'\|CITS\s*:([\[\]\d\s,]+)', note)]
flatten = lambda l : [item for sublist in l for item in sublist]
cits = flatten(raw_list)

然而，这也会与无意义的事件相匹配＆＃34; | CITS：[[1,7 [,,＆＃34;。

Python正则表达式匹配方括号内的数字列表

2 个答案: