我正在尝试通过 txt 文件提取某些数字,将它们存储在列表中,然后使用这些数字提取存储在同一文件中的字符串。我的代码适用于我的一些文件,但我突然收到列表索引超出范围错误。
这是我试图导出的文本文件部分的示例
/note="tRNA-Arg2"
tRNA 5573494..5573567
/locus_tag="Tery_R0035"
/product="tRNA-Arg"
或
tRNA complement(5630800..5630872)
/locus_tag="Tery_R0036"
/product="tRNA-His"
我正在尝试获取在 tRNA 之后写入的数字。
这是我将数字提取到列表中的代码:
def extract_numbers(line):
#empty list
numbers = []
#creates a buffer (temporary space)
digits = ""
#for character in the line
for c in line:
#if its a digit
if c.isdigit():
#add character to the buffer
digits += c
#if it isnt a number
else:
#if there is something in the buffer (ie its not 0)
if len(digits) > 0:
#add the buffer to the numbers list
numbers.append(digits)
#empty again
digits = ""
#to make sure the last number is added to the list
if len(digits) > 0:
numbers.append(digits)
return numbers
并使用最后一个函数将其写入文件本身
def extract_tRNA(path):
with io.open(path, mode="r", encoding="utf-8") as file:
genome = file.readlines()
start_stop = []
for line in genome:
if "tRNA" in line[0:21]:
numbers = extract_numbers(line[21:])
start_stop.append((int(numbers[0]), int(numbers[1])))
return start_stop
然后,我用这个来运行它:
work_dir = "/Users/..."
for path in glob.glob(os.path.join(work_dir, "*.gbff")):
sequences = extract_seq(path)
tRNA_loc = extract_tRNA(path)
extract_genes(path, tRNA_loc, sequences)
print(path)
是我的文件还是代码?我也不确定是否有更简单的方法来做同样的事情?
感谢您的帮助!
更新尝试正则表达式:
work_dir = "where my files are"
for path in glob.glob(os.path.join(work_dir, "*.gbff")):
with io.open(path, mode="r", encoding="utf-8") as file:
genome = file.readlines()
for line in genome:
if "tRNA" in line[0:21]:
p = re.compile('\d+') # \d means digit and + means one or more
m = p.findall(line)
print(m)
答案 0 :(得分:1)
根据您对要实现的目标的描述,这应该可行。请注意,file.txt
是您在上面包含的示例:
import re
with open("file.txt") as f:
data =f.readlines()
numberList = []
for line in data:
dataList = line.split() #words separated by spaces split into list
try: #if tRNA is not in line
numberIndex = dataList.index("tRNA") + 1 # the numbers that are written after tRNA
numberList.append(dataList[numberIndex])
except Exception as _:
continue
#The above cleans you data from all other numbers i.e "Tery_R0035"
#Taken from top answer (@rajah9)
p = re.compile('\d+') # \d means digit and + means one or more
for numData in numberList:
m = p.findall(numData)
print(m)
答案 1 :(得分:1)
我假设您想要从函数 extract_numbers
返回的字符串列表。
Python 使用称为正则表达式 (documentation) 的强大功能。
这是一个提取一个或多个数字的所有字符串的示例。
import re
line = " tRNA 5573494..5573567"
p = re.compile('\d+') # \d means digit and + means one or more
m = p.findall(line)
m # returns ['5573494', '5573567']