按顺序提取由方括号括起来的字符串的唯一部分

时间:2018-08-01 06:29:33

标签: python

我想提取方括号中包含的数据并将其打印在另一个文本文件中。

我的文本文件是

  

RAH71880.1酚单加氧酶[aculatinus曲霉CBS 121060]   PVV21043.1酚2-单加氧酶[γ变形杆菌的共生体   假丝酵母] PVV21041.1酚羟化酶   虎尾草共生体] PYH66749.1酚单加氧酶   [瓦斯曲霉CBS 113365] PYH31415.1酚单加氧酶   [新曲霉CBS 115656] PUB86175.1苯酚2-单加氧酶   [大戟C的γ变形杆菌共生体] PUB86141.1苯酚   2-单加氧酶[大假单胞菌的γ-变形杆菌属共生]   PUB86139.1酚羟化酶[Ctena的γ变形杆菌属共生菌   orbiculata] PUB79626.1酚羟化酶[γ变形杆菌   C叶蝉的共生体] PUB79624.1酚2-单加氧酶[γ   毕赤酵母共生变形杆菌] PUB72973.1苯酚   2-单加氧酶[大假单胞菌的γ-变形杆菌属共生]   PUB72971.1酚羟化酶[Ctena的γ变形杆菌共生体   orbiculata] PWY90296.1酚单加氧酶[曲霉   sclerotioniger CBS 115572] PWY63616.1酚单加氧酶   [桉树曲霉CBS 122712]

我用过这个程序

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
for l in infile:
    outfile.write(l.split()[-1] + '\n')
infile.close()
outfile.close()

但它不起作用

3 个答案:

答案 0 :(得分:0)

您要在程序中使用正则表达式。 正则表达式对于提取文本非常有用。 例如:

   import re

   s = "alphaCustomer bla bla bla [dataFindMe] bla bla bla"
   m = re.search(r"\[(\.+)\]", s)
   print m.group(1)

输出

   dataFindMe

答案 1 :(得分:0)

这应该完全满足您的要求:

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"
    outfile.write(line)


infile.close()
outfile.close()

out3.txt

RAH71880.1 phenol monooxygenase [Aspergillus aculeatinus CBS 121060]
PVV21043.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PVV21041.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PYH66749.1 phenol monooxygenase [Aspergillus vadensis CBS 113365]
PYH31415.1 phenol monooxygenase [Aspergillus neoniger CBS 115656]
PUB86175.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86141.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86139.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79626.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79624.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72973.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72971.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PWY90296.1 phenol monooxygenase [Aspergillus sclerotioniger CBS 115572]
PWY63616.1 phenol monooxygenase [Aspergillus eucalypticola CBS 122712]

out5.txt

Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712

编辑

如果只想打印出唯一的行,则可以这样更新源代码:

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
unique = []

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"

    if line not in unique:
        unique.append(line)
        outfile.write(line)


infile.close()
outfile.close()

然后您将获得如下输出(out5.txt):

Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712

答案 2 :(得分:0)

这是一个可正常工作并保留localhost:8080/login.php 的正则表达式解决方案。 正则表达式:[ ]

前导r'(\[.+\])'表示原始字符串,可防止python插值r字符。

外部括号\\是一个捕获组,并且将捕获到( )返回的元组中。

m.groups()必须被转义,因为它们是一个正则表达式元字符。

[表示任意字符(.+)中的一个或多个+

编辑:此版本使用.删除重复项并保留顺序(OrderedDict不会):

set

输入out5.txt:

import re
from collections import OrderedDict
uniq = OrderedDict()

with open('gash.txt') as inf:
    for line in inf:
       m = re.search(r'(\[.+\])', line)
       if m:
           uniq[m.groups()[0]] = None

with open('out5.txt', 'w') as outf:
    print("\n".join(uniq.keys()), file=outf)