如何基于正则表达式模式从文本文件中提取数据

时间:2016-04-15 13:32:50

标签: python regex

我需要一些python程序的帮助。我已经尝试了很多东西,好几个小时,但它不起作用。

任何可以帮助我的人?

这就是我需要的:

  • 我有这个文件:http://www.filedropper.com,其中包含有关蛋白质的信息。
  • 我想只过滤与...相匹配的蛋白质。
  • 从这些蛋白质中,我只需要...(6个令牌的文本,>sp|之后,物种(第二行,[]之间)
  • 我想要...和..在...中,最终在表格中。

....

Human                         AAA111
Mouse                         BBB222
Fruit fly                     CCC333

到目前为止我所拥有的:

import re

def main():
    ReadFile()
    file = open ("file.txt", "r")
    FilterOnRegEx(file)

def ReadFile():
    try:
        file = open ("file.txt", "r")
    except IOError:
        print ("File not found!")
    except:
        print ("Something went wrong.")

def FilterOnRegEx(file):
    f = ("[AG].{4}GK[ST]")
    for line in file:
        if f in line:
            print (line)


main()

如果你帮助我,你就是英雄!

2 个答案:

答案 0 :(得分:3)

My first recommendation is to use a with statement when opening files:

with open("ploop.fa", "r") as file:
    FilterOnRegEx(file)

The problem with your FilterOnRegEx method is: if ploop in line. The in operator, with string arguments, searches the string line for the exact text in ploop.

Instead you need to compile the text form to an re object, then search for matches:

def FilterOnRegEx(file):
    ploop = ("[AG].{4}GK[ST]")
    pattern = re.compile(ploop)
    for line in file:
        match = pattern.search(line)
        if match is not None:
            print (line)

This will help you to move forward.

As a next step, I would suggest learning about generators. Printing the lines that match is great, but that doesn't help you to do further operations with them. I might change print to yield so that I could then process the data further such as extracting the parts you want and reformatting it for output.

As a simple demonstration:

def FilterOnRegEx(file):
    ploop = ("[AG].{4}GK[ST]")
    pattern = re.compile(ploop)
    for line in file:
        match = pattern.search(line)
        if match is not None:
            yield line

with open("ploop.fa", "r") as file:
    for line in FilterOnRegEx(file):
        print(line)


Addendum: I ran the code I posted, above, using the sample of the data that you posted and it successfully prints some lines and not others. In other words, the regular expression did match some of the lines and did not match others. So far so good. However, the data you need is not all on one line in the input! That means that filtering individual lines on the pattern is insufficient. (Unless, of course, that I don't see the correct line breaks in the question) The way the data is in the question you'll need to implement a more robust parser with state to know when a record begins, when a record ends, and what any given line is in the middle of a record.

答案 1 :(得分:0)

这似乎适用于您的示例文本。我不知道你是否可以为每个文件提取多个提取物,而且我没时间在这里,所以如果需要,你必须扩展它:

#!python3
import re

Extract = {}

def match_notes(line):
    global _State
    pattern = r"^\s+(.*)$"
    m = re.match(pattern, line.rstrip())
    if m:
        if 'notes' not in Extract:
            Extract['notes'] = []

        Extract['notes'].append(m.group(1))
        return True
    else:
        _State = match_sp
        return False

def match_pattern(line):
    global _State
    pattern = r"^\s+Pattern: (.*)$"
    m = re.match(pattern, line.rstrip())
    if m:
        Extract['pattern'] = m.group(1)
        _State = match_notes
        return True
    return False

def match_sp(line):
    global _State
    pattern = r">sp\|([^|]+)\|(.*)$"
    m = re.match(pattern, line.rstrip())
    if m:
        if 'sp' not in Extract:
            Extract['sp'] = []
        spinfo = {
            'accession code': m.group(1),
            'other code': m.group(2),
        }
        Extract['sp'].append(spinfo)
        _State = match_sp_note
        return True
    return False

def match_sp_note(line):
    """Second line of >sp paragraph"""
    global _State
    pattern = r"^([^[]*)\[([^]]+)\)"
    m = re.match(pattern, line.rstrip())
    if m:
        spinfo = Extract['sp'][-1]
        spinfo['note'] = m.group(1).strip()
        spinfo['species'] = m.group(2).strip()
        spinfo['sequence'] = ''
        _State = match_sp_sequence
        return True
    return False

def match_sp_range(line):
    """Last line of >sp paragraph"""
    global _State
    pattern = r"^\s+(\d+) - (\d+):\s+(.*)"
    m = re.match(pattern, line.rstrip())
    if m:
        spinfo = Extract['sp'][-1]
        spinfo['range'] = (m.group(1), m.group(2))
        spinfo['flags'] = m.group(3)
        _State = match_sp
        return True
    return False

def match_sp_sequence(line):
    """Middle block of >sp paragraph"""
    global _State

    spinfo = Extract['sp'][-1]

    if re.match("^\s", line):
        # End of sequence. Check for pattern, reset state for sp
        if re.match(r"[AG].{4}GK[ST]", spinfo['sequence']):
            spinfo['ag_4gkst'] = True
        else:
            spinfo['ag_4gkst'] = False

        _State = match_sp_range
        return False

    spinfo['sequence'] += line.rstrip()
    return True

def match_start(line):
    """Start of outer item"""
    global _State
    pattern = r"^Hits for ([A-Z]+\d+)|([^:]+) : (?:\[occurs (\w+)\])?"
    m = re.match(pattern, line.rstrip())
    if m:
        Extract['pattern_id'] = m.group(1)
        Extract['title'] = m.group(2)
        Extract['occurrence'] = m.group(3)
        _State = match_pattern
        return True
    return False

_State = match_start

def process_line(line):
    while True:
        state = _State
        if state(line):
            return True

        if _State is not state:
            continue

        if len(line) == 0:
            return False

        print("Unexpected line:", line)
        print("State was:", _State)
        return False

def process_file(filename):
    with open(filename, "r") as infile:
        for line in infile:
            process_line(line.rstrip())

process_file("ploop.fa")
import pprint
pprint.pprint(Extract)