Question

因此，我不经常使用正则表达式，因此这可能是一个愚蠢或显而易见的问题，但我并没有真正找到答案。

我正在尝试从看起来像这样的字符串中匹配特定的模式：

Probe Set ID,Gene Title,Gene Symbol,Chromosomal Location,Entrez Gene,Fold difference
,206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6
,221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6,
203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0,35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7,221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4,212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5,209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5,201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1,221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9

从这段文字（这是一些生物学的东西）中，我想提取一个像这样的图案：

221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6

现在您可以看到前两行用换行符隔开，但不是全部。所以当我运行这个：

l = re.findall(r'(\d+[_]..+[,]\d+[\.]\d+[,])',string)

我只能提取（提取）用换行符分隔的行，而不能提取（不提取）未用换行符分隔的行。 尽管根据我的观点，它也适用于非分隔行。

这是怎么了？

我正在使用Python3.x btw。

Answer 1

您可以使用正则表达式：

,?(\d+_.*?,\d+\.\d+),?。

,?可选地匹配逗号。
(\d+_.*?,\d+\.\d+)捕获组。匹配一个或多个数字，下划线_，任何惰性字符，逗号,，更多数字，句号.，更多数字。
,?可选地匹配逗号。

您可以实时测试正则表达式here。

正则表达式的问题是捕获组内部使用的运算符的贪婪性。当您使用.+组合时，引擎将尝试尽可能匹配任何内容。您必须使用延迟量词.*?，以确保正则表达式尽可能少地匹配。

此外，请注意，对单个字符（例如逗号和下划线）使用字符类是多余的，只需匹配字符本身即可。

Python代码段：

>>str = """Probe Set ID,Gene Title,Gene Symbol,Chromosomal Location,Entrez Gene,Fold difference
,206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6
,221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6,
203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0,35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7,221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4,212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5,209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5,201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1,221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9"""

>>re.findall(r',?(\d+_.*?,\d+\.\d+),?',str)

['206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6', '221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6', '203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0', '35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7', '221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4', '212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5', '209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5', '201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1', '221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9']

Answer 2

作为pkpkpk答案的扩展，我想补充一下，您可以通过编译（至少，如果多次执行findall（或类似操作），可以提高性能）并通过将它们与管道符号连接来同时使用多个选项|。

import re
dir(re)

返回

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_alphanum_bytes', '_alphanum_str', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pattern_type', '_pickle', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']

rec_something = re.compile(r'…', re.DOTALL|re.IGNORECASE|re.MULTILINE)  
rec_something.findall(input_str)

无法确定我的常规Exp出了什么问题。在Python中

2 个答案: