Question

我正在尝试通过 txt 文件提取某些数字，将它们存储在列表中，然后使用这些数字提取存储在同一文件中的字符串。我的代码适用于我的一些文件，但我突然收到列表索引超出范围错误。

这是我试图导出的文本文件部分的示例

                     /note="tRNA-Arg2"
     tRNA            5573494..5573567
                     /locus_tag="Tery_R0035"
                     /product="tRNA-Arg"

或

     tRNA            complement(5630800..5630872)
                     /locus_tag="Tery_R0036"
                     /product="tRNA-His"

我正在尝试获取在 tRNA 之后写入的数字。

这是我将数字提取到列表中的代码：

def extract_numbers(line):
    #empty list
    numbers = []
    #creates a buffer (temporary space)
    digits = ""
    #for character in the line
    for c in line:
        #if its a digit
        if c.isdigit():
            #add character to the buffer
            digits += c
        #if it isnt a number
        else:
            #if there is something in the buffer (ie its not 0)
            if len(digits) > 0:
                #add the buffer to the numbers list
                numbers.append(digits)
                #empty again
                digits = ""
    #to make sure the last number is added to the list
    if len(digits) > 0:
        numbers.append(digits)
    return numbers

并使用最后一个函数将其写入文件本身

def extract_tRNA(path):
    with io.open(path, mode="r", encoding="utf-8") as file:
        genome = file.readlines()
        start_stop = []
        for line in genome:
            if "tRNA" in line[0:21]:
                numbers = extract_numbers(line[21:])
                start_stop.append((int(numbers[0]), int(numbers[1])))
        return start_stop

然后，我用这个来运行它：

work_dir = "/Users/..."
for path in glob.glob(os.path.join(work_dir, "*.gbff")):

    sequences = extract_seq(path)
    tRNA_loc = extract_tRNA(path)
    extract_genes(path, tRNA_loc, sequences)
    print(path)

是我的文件还是代码？我也不确定是否有更简单的方法来做同样的事情？

感谢您的帮助！

更新尝试正则表达式：

work_dir = "where my files are"
for path in glob.glob(os.path.join(work_dir, "*.gbff")):
    with io.open(path, mode="r", encoding="utf-8") as file:
        genome = file.readlines()
        for line in genome:
            if "tRNA" in line[0:21]:
                p = re.compile('\d+')  # \d means digit and + means one or more
                m = p.findall(line)
        print(m)

Answer 1

根据您对要实现的目标的描述，这应该可行。请注意，file.txt 是您在上面包含的示例：

import re

with open("file.txt") as f:
    data =f.readlines()
    
    numberList = []
    
    for line in data:
        dataList = line.split() #words separated by spaces split into list
        try: #if tRNA is not in line
            numberIndex = dataList.index("tRNA") + 1 # the numbers that are written after tRNA
            numberList.append(dataList[numberIndex])
        except Exception as _:
            continue

#The above cleans you data from all other numbers i.e "Tery_R0035"

#Taken from top answer (@rajah9)
p = re.compile('\d+') # \d means digit and + means one or more
for numData in numberList:
    m = p.findall(numData)
    print(m)

Answer 2

我假设您想要从函数 extract_numbers 返回的字符串列表。

Python 使用称为正则表达式 (documentation) 的强大功能。

这是一个提取一个或多个数字的所有字符串的示例。

import re

line = "     tRNA            5573494..5573567"
p = re.compile('\d+') # \d means digit and + means one or more
m = p.findall(line)
m # returns ['5573494', '5573567']

如何修复此列表索引超出范围错误？

2 个答案: