使用python从txt文件中提取特定单词

时间:2019-10-03 07:16:39

标签: python-3.x

我需要一些使用Python的帮助,以便从txt文件中提取特定部分。 实际上,我希望从以下文件中仅提取Organism :部分 但是没有保留“ ()”之间的部分。

在这个例子中,它将给出:

Select item 1949871
1.

Amel_HAv3.1

Organism:
    Apis mellifera (honey bee)

Infraspecific name:
    Strain: DH4

Sex:
    male

Submitter:
    Uppsala University

Date:
    2018/09/10

Assembly level:
    Chromosome

Genome representation:
    full

RefSeq category:
    representative genome

GenBank assembly accession:
    GCA_003254395.2 (latest) 

RefSeq assembly accession:
    GCF_003254395.2 (latest) 

IDs:
    1949871 [UID] 7372188 [GenBank] 7434688 [RefSeq]

Select item 2027291
2.

Obir_v5.4

Organism:
    Ooceraea biroi (clonal raider ant)

Submitter:
    The Rockefeller University

Date:
    2018/10/23

Assembly level:
    Chromosome

Genome representation:
    full

RefSeq category:
    representative genome

GenBank assembly accession:
    GCA_003672135.1 (latest) 

RefSeq assembly accession:
    GCF_003672135.1 (latest) 

IDs:
    2027291 [UID] 7620928 [GenBank] 7654158 [RefSeq]

Select item 1769491
3.

Nlec1.1

Organism:
    Neodiprion lecontei (redheaded pine sawfly)

Sex:
    male

Submitter:
    University of Kentucky

Date:
    2018/06/21

Assembly level:
    Chromosome

Genome representation:
    full

RefSeq category:
    representative genome

GenBank assembly accession:
    GCA_001263575.2 (latest) 

RefSeq assembly accession:
    n/a

IDs:
    1769491 [UID] 6705508 [GenBank] 

Select item 294348
4.

Bter_1.0

并且我希望在python中仅保留有机体之后的部分:(不包括“ ()之间的部分),并为此示例取用:

Apis mellifera
Neodiprion lecontei
Ooceraea biroi

请问有人有主意吗?

感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

您可以为此使用简单的regex

re.findall(r'Organism:\n\s*(.*) \(', text)

答案 1 :(得分:1)

这是一个更完整的代码段,使用regex(不省略多行标志):

import re

with open("your_file.txt", "r") as f:
    content = f.read()

    matches = re.findall(r"Organism:\s*(.+)\s*\(", content, re.M)

for m in matches:
    print(m)