python regex可以省略文本中的复杂引用样式

时间:2019-01-08 16:19:00

标签: python regex

我已经将文件内容读入python,并且希望摆脱所有遵循相同通用格式的引用:

(Author et al., .............. \nGoogle Scholar) # there could be many '\nGoogle Scholar's within the brackets
  

引言胰岛中的内分泌细胞   朗格汉斯分泌胰岛素和胰高血糖素来响应葡萄糖   扰动以维持葡萄糖稳态。胰岛素分泌   β细胞表现出形态,功能和分子   变体,表明它们可能由具有   专业的任务和生理反应(古铁雷斯等,   2017 Gutierrez G.D. Gromada J.Sussel L.   胰腺β细胞基因2017; 8:22Crossref \ nPubMed \ nScopus   (11)\ nGoogle Scholar,Roscioni等人,2016年,Roscioni S.S. Migliorini A.   Gegg M. Lickert H.胰岛结构对细胞的影响   异质性,可塑性和功能内分泌牧师。 2016; 12:   695-709Crossref \ nPubMed \ nScopus(36)\ nGoogle Scholar)。特点   β细胞异质性包括葡萄糖反应性和分泌   活动.....可视化胰腺中的转录本,但是   如果不使用诸如   光开关染料(Cui et al。,2018 Cui Y.Hu D.Markillie L.M.   克里斯勒W.B. Gaffrey M.J. Ansong C.Sussel L.Orr G.   基于定位成像的荧光原位杂交(fliFISH)   用于准确检测和计数单个RNA拷贝   细胞核酸研究2018; 46:e7Crossref \ nPubMed \ nScopus   (2)\ nGoogle学术搜索)。我们已经优化了标准组织smFISH   协议(Lyubimova et al。,2013Lyubimova A.Itzkovitz S.Junker J.P.   范志平Wu X. van Oudenaarden A.单分子mRNA检测和   在哺乳动物组织中计数。协议。 2013; 8:   1743-1758Crossref \ nPubMed \ nScopus(62)\ nGoogle Scholar)   大大延长了mRNA变性的时间   在探针杂交步骤之前(从5分钟到至少3小时)。

所需的输出

  

引言胰岛中的内分泌细胞   朗格汉斯分泌胰岛素和胰高血糖素来响应葡萄糖   扰动以维持葡萄糖稳态。胰岛素分泌   β细胞表现出形态,功能和分子   变体,表明它们可能由具有   专业的任务和生理反应。 β细胞的特征   异质性包括葡萄糖反应性和分泌活性   .....可视化胰腺的转录本   如果不使用诸如   光转换染料。我们已经优化了标准组织smFISH   通过大幅增加mRNA变性的时间   在探针杂交步骤之前,从5分钟到至少   3小时。

我找不到一次可以忽略所有引用的正则表达式,因此我不得不分两部分进行操作:

  1. 找到每个'\ nGoogle Scholar)'事件的所有位置。
  2. 从每个位置向后延伸,直到出现相应的左括号为止,然后省略这些索引之间的字符。

我尝试如下操作:

def remove(test_str):
        regex=re.compile('\\nGoogle Scholar\)')
        starts=[]
        ends=[]
        ret=''
        for end in regex.finditer(test_str): #find all 'Google Scholar)'
            ends.append(m.end())
        for e in ends:                       #find all starting brackets
            i=e
            while True:
                if bool(re.match('\(\D+',test_str[i-2:i])):
                    starts.append(i-2)
                    break
                else:
                    i-=1
        start=test_str[:starts[0]]           #omit all characters in between
        starts=starts[1:]
        end=test_str[ends[-1]:]
        ends=ends[:-1]
        for i,j in zip(starts,ends):
            ret=ret+test_str[j:i]
        return start+ret+end

但是此策略失败了,因为我用来查找每个起始括号(\(\D+)的正则表达式不够精确-引用中经常有封闭的括号,例如

  

(Cui et al。,2018崔Y.胡D.马克里利L.M.克里斯勒W.B.   Ansong C. Sussel L. Orr G.基于涨落定位的成像   荧光原位杂交(fliFISH),用于准确检测和   单细胞中RNA拷贝的计数。 2018; 46:   e7Crossref \ nPubMed \ nScopus(2)\ nGoogle Scholar)

在这种情况下,过早停止寻找正确的开口支架。...

有人能推荐一种删除所有引用的好方法吗?

3 个答案:

答案 0 :(得分:1)

根据您描述的模式,您可以使用此正则表达式,

(?s)\(.*?Google Scholar\) ?

并将其替换为空字符串。这里的(?s)用于启用.来匹配新行。

Check here

这是一个python代码演示,

import re

s = 'Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22Crossref\nPubMed\nScopus (11)\nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709Crossref\nPubMed\nScopus (36)\nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7Crossref\nPubMed\nScopus (2)\nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758Crossref\nPubMed\nScopus (62)\nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.'

replacedStr = re.sub(r'(?s)\(.*?Google Scholar\) ?','',s)
print(replacedStr)

按照您在帖子中提到的内容打印以下内容。

  

引言胰岛中的内分泌细胞   朗格汉斯分泌胰岛素和胰高血糖素来响应葡萄糖   扰动以维持葡萄糖稳态。胰岛素分泌   β细胞表现出形态,功能和分子   变体,表明它们可能由具有   专业的任务和生理反应。 β细胞的特征   异质性包括葡萄糖反应性和分泌活性   .....可视化胰腺的转录本   如果不使用诸如   光转换染料。我们已经优化了标准组织smFISH   通过大幅增加mRNA变性的时间   在杂交之前,从5分钟到至少   3小时。

答案 1 :(得分:0)

我将通过以下方式解决该问题,该问题与您想要的字母匹配,并且可以处理文本中的括号(不是引用):

  1. 寻找开始\(

  2. 寻找重复的[^()]+(?:\([^()]+\))?,它是一个或多个非括号的字符,后跟可选的( )对,其中有一个或多个非括号的字符。

  3. 寻找结尾\nGoogle Scholar\)

  4. 使用空格分割并合并以删除多个空格

代码:

import re
text = 'Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22Crossref\nPubMed\nScopus (11)\nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709Crossref\nPubMed\nScopus (36)\nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7Crossref\nPubMed\nScopus (2)\nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758Crossref\nPubMed\nScopus (62)\nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.'
fixed_text = ' '.join(re.sub(r'\((?:[^()]+(?:\([^()]+\))?)+\nGoogle Scholar\)', '', text).split())
print(fixed_text)

输出:

  

引言胰岛中的内分泌细胞   朗格汉斯分泌胰岛素和胰高血糖素来响应葡萄糖   扰动以维持葡萄糖稳态。胰岛素分泌   β细胞表现出形态,功能和分子   变体,表明它们可能由具有   专业的任务和生理反应。 β细胞的特征   异质性包括葡萄糖反应性和分泌活性   .....可视化胰腺的转录本   如果不使用诸如   光转换染料。我们已经优化了标准组织smFISH   通过大幅增加mRNA变性的时间   在探针杂交步骤之前,从5分钟到至少   3小时。

可以通过更改以下代码来进行改进,该代码也删除前导\(之前的空格,但随后与所需的输出不匹配(存在缺陷):

fixed_text = re.sub(r' ?\((?:[^()]+(?:\([^()]+\))?)+\nGoogle Scholar\)', '', string)
  

引言胰岛中的内分泌细胞   朗格汉斯分泌胰岛素和胰高血糖素来响应葡萄糖   扰动以维持葡萄糖稳态。胰岛素分泌   β细胞表现出形态,功能和分子   变体,表明它们可能由具有   专业的任务和生理反应。 β细胞的特征   异质性包括葡萄糖反应性和分泌活性   .....可视化胰腺的转录本   如果不使用诸如   光敏染料。我们已经优化了标准组织smFISH   通过大幅增加mRNA变性的时间   在探针杂交步骤之前,从5分钟到至少   3小时。

答案 2 :(得分:0)

import re

if __name__ == '__main__':
    source = """Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22Crossref\nPubMed\nScopus (11)\nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709Crossref\nPubMed\nScopus (36)\nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7Crossref\nPubMed\nScopus (2)\nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758Crossref\nPubMed\nScopus (62)\nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr."""
    output = re.sub(' \(.*? etal\., .*?\\nGoogle Scholar\)', '', source, flags=re.DOTALL)

    print(output)

输出

Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses. Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes. We have optimized the standard tissue smFISH protocol by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.