使用正则表达式提取字符串

时间:2019-09-20 00:42:49

标签: python python-regex

我有以下字符串:

  1. 低质量蛋白:半胱氨酸蛋白酶5 [Solanum pennellii]
  2. 预测:低质量蛋白质:未鉴定的蛋白质LOC107059219 [Solanum pennellii]
  3. XP_019244624.1预测:过氧化物酶40样[Nicotiana detecta]
  4. 转座子TNT 1-94 [Vitis vinifera] RVW92024.1与逆转录病毒相关的Pol多聚蛋白
  5. 假设性蛋白质VITISV_035070 [Vitis vinifera]

如何从上述字符串中提取以下字符串?

  1. 半胱氨酸蛋白酶5-样
  2. 未表征的蛋白LOC107059219
  3. 过氧化物酶40样
  4. 转座子TNT 1-94中与逆转录病毒有关的Pol多聚蛋白
  5. 假设蛋白VITISV_035070

先谢谢您

2 个答案:

答案 0 :(得分:0)

<button id="btn-1" data-width="w-1/3">Mobile</button>
<button id="btn-2" data-width="w-2/3">Tablet</button>
<button id="btn-3" data-width="w-full">Desktop</button>

<div class="frame">
  Some Content
</div>

输出

s = '''LOW QUALITY PROTEIN: cysteine proteinase 5-like  [Solanum pennellii]
PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]
XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata]
RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera]
hypothetical protein VITISV_035070 [Vitis vinifera]'''

import re
rgx = '(:?)\s([\w\s-]+)\s(\[.+\])'

list1 = []
for m in re.findall(rgx, s):
    list1.append(m[1])

print(list1)

查看https://regex101.com/r/HATKMa/1以获得详细说明。

答案 1 :(得分:0)

我认为这个问题不需要正则表达式。我更喜欢以下解决方案,因为它易于理解

TO_SERVER_ODBCDSN="-D -S MyMSSQLServer"
RECOMMEDED_IMPORT_MODE='-c' # makes a big difference, see https://stackoverflow.com/a/16310219/8236733
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log