我有一个ID列表,其中冗长的描述用分号分隔。以下是一个ID及其描述的示例。
ID Description
O95831 activation of cysteine-type endopeptidase activity involved in apoptotic process; apoptotic DNA fragmentation; apoptotic process; cell redox homeostasis; chromosome condensation; DNA catabolic process; intrinsic apoptotic signaling pathway in response to endoplasmic reticulum stress; mitochondrial respiratory chain complex I assembly; NAD(P)H oxidase activity; neuron apoptotic process; neuron differentiation; oxidoreductase activity, acting on NAD(P)H; positive regulation of apoptotic process; regulation of apoptotic DNA fragmentation
问题:找出一种文本挖掘方法,其中的表达方式是"线粒体"或者"线粒体"或"线粒体"提到了。正则表达式对解决这个问题有用吗?或者其他可能有用的方法?
预期结果:提取短语"线粒体"提到
O95831 ;mitochondrial respiratory chain complex I assembly;
感谢您的帮助,
答案 0 :(得分:1)
你可以使用像
这样的正则表达式(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)
捕获组1和2将包含
O95831 ;mitochondrial respiratory chain complex I assembly;
示例:http://regex101.com/r/mR8xA7/1
Python代码就像
>>> re.findall(r"""(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)""", str)
[('095831', '; mitochondrial respiratory chain complex I assembly;')]