Question

我正在研究用于计算给定分子集的各种热力学性质的代码。为此，我必须将9个系数插入一组方程中以获得所需的值。这些系数因分子而异，可以从NASA Thermobuild数据库中检索，该数据库具有以下格式：

C2Cl4四氯乙烯HF298 = -5.034 kcal Burcat G3B3
3 T05 / 08 C 2.00CL 4.00 0.00 0.00 0.00 0 165.8322000 -21064.348 50.000 200.000 7 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 0.0 19563.551 -5.821898980D + 03 4.158580080D + 02-7.790140830D + 00 1.615966138D-01 -6.791370520D-04 1.598431875D-06-1.556882412D-09 0.000000000D + 00-6.205198010D + 03 5.774956220D + 01 200.000 1000.000 7 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 0.0 19563.551

4.940446670D + 04 -1.030763621D + 03 1.098508036D + 01 1.645945662D-02-2.178412229D-05 1.410593520D-08-3.663931630D-12 0.000000000D + 00 -3.353235260 D + 02-2.878634227D + 01 1000.000 6000.000 7 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 0.0 19563.551 -3.067008915D + 05-1.128336557D + 03 1.681089243D + 01-3.159107946D-04 6.850908950D-08 -7.749796920D-12 3.556100470D-16 0.000000000D + 00-1.944193938D + 03-5.966771040D + 01

计算所需的具体数字以粗体显示。

（或者，以代码块形式，因此它更加整洁，更接近数据库.txt文件中的实际排列）

C2Cl4 Tetrachloroethylene  HF298=-5.034 kcal Burcat G3B3                         
3 T05/08 C   2.00CL  4.00    0.00    0.00    0.00 0  165.8322000     -21064.348
 50.000   200.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-5.821898980D+03 4.158580080D+02-7.790140830D+00 1.615966138D-01-6.791370520D-04
 1.598431875D-06-1.556882412D-09 0.000000000D+00-6.205198010D+03 5.774956220D+01
 200.000  1000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
 4.940446670D+04-1.030763621D+03 1.098508036D+01 1.645945662D-02-2.178412229D-05
 1.410593520D-08-3.663931630D-12 0.000000000D+00-3.353235260D+02-2.878634227D+01
 1000.000  6000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-3.067008915D+05-1.128336557D+03 1.681089243D+01-3.159107946D-04 6.850908950D-08
-7.749796920D-12 3.556100470D-16 0.000000000D+00-1.944193938D+03-5.966771040D+01

数据库中有数百个分子，但是我只需要大约50个左右的系数，我需要一个可以通过的函数，从预先编写的列表中找到所需的分子种类，然后挑选出每个分子系数并返回它们，以便我可以在计算中使用它们（并将“ D + 0％N”转换为“ E + 0％N”-我不确定为什么该数据库使用D而不是E来表示科学计数法）

我对SQL一点都不熟悉，所以我只是专注于基本的Python搜索功能。到目前为止，我的情况是这样：

import pandas as pd
import csv
import math
import numpy as np
species_list=[]
species=pd.read_table('Species list.txt') #list of molecular species I need coefficients for
species_temp=species['Species']
for i in range(len(species_temp)):
    species_list.append(species_temp[i])
with open('NEWNASA.TXT','rt') as database: #loads massive coefficient database
    for species_name in species_list:
        species_name=species_name+" " #to avoid returning ionic forms
            for line in database:
                if species_name in line:
                print line #test to see if it's working

但是，a）在找到第一个分子种类后，这种方法就停止工作了，b）我仍然不确定如何告诉代码找到我计算所需的特定系数。我在想它会涉及到正则表达式（我也没有很多经验）和索引，但这是我所了解的范围。任何指示或建议将不胜感激！

谢谢！

Answer 1

打开的文件（database）是一次性的迭代器。您不能多次遍历。解决方案是交换for循环-如果文件不是太大，则将文件的所有行加载到列表中。

for line in database:
    for species_name in species_list:
        species_name = species_name + " "
        if species_name in line:
            print line

Answer 2

我将解决从文本数据库中的记录中提取所需数据的问题。

找到感兴趣的记录（<a {% if item.link %} href="{{ item.link }}", target="_blank", rel="noopener", aria-label="{{ item }}" {% endif %}> --- Content --- </a>）后，您需要前进到该记录的第七行和第八行并提取系数。

record format表示每行长80个字符，而您感兴趣的每个数字长16个字符。因此，将第七和第八行分成五个相等的部分（Split a string to even sized chunks）并对其进行浮动

设置：

if species_name in line:

过程：

import io

r = '''C2Cl4 Tetrachloroethylene  HF298=-5.034 kcal Burcat G3B3                         
3 T05/08 C   2.00CL  4.00    0.00    0.00    0.00 0  165.8322000     -21064.348
 50.000   200.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-5.821898980D+03 4.158580080D+02-7.790140830D+00 1.615966138D-01-6.791370520D-04
 1.598431875D-06-1.556882412D-09 0.000000000D+00-6.205198010D+03 5.774956220D+01
 200.000  1000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
 4.940446670D+04-1.030763621D+03 1.098508036D+01 1.645945662D-02-2.178412229D-05
 1.410593520D-08-3.663931630D-12 0.000000000D+00-3.353235260D+02-2.878634227D+01
 1000.000  6000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-3.067008915D+05-1.128336557D+03 1.681089243D+01-3.159107946D-04 6.850908950D-08
-7.749796920D-12 3.556100470D-16 0.000000000D+00-1.944193938D+03-5.966771040D+01'''

db = io.StringIO(r)
species_name = 'Tetrachloroethylene'

您需要解决@FMc带来的问题。当前，您的代码遍历列表中的名称，对于每个名称，遍历整个数据库文件以查找名称。要继续寻找名字，您需要通过将文件指针设置为开头def get_coefficients(line): '''Split line into 5 floats. line has five 16 character numbers. ''' #coefficients = [line[i:i+16] for i in range(0,len(line),16)] coefficients = [line[i:i+16] for i in range(0,80,16)] # 80 cols/line coefficients = map(lambda q: q.replace('D','E'), coefficients) coefficients = [float(thing) for thing in coefficients] return coefficients for line in db: if species_name in line: # first lne of the record # skip to the seventh line of the record for _ in range(6): line = next(db) coefficients_1 = get_coefficients(line) print(coefficients_1) # skip to the eighth line of the record line = next(db) coefficients_2 = get_coefficients(line) print(coefficients_2)来再次开始查看文件的开头。

这将是非常低效的。如@Fmc所示，您需要遍历数据库的每一行，并查看它是否包含您的物种名称之一。为了增强此功能，database.seek(0)应该是set。

species_list

很不幸，第一行的database record format与示例记录之间存在差异-

在示例记录中，第一行包含物种公式和名称。数据库记录格式表建议第一行包含名称或。
数据库记录格式表明，该名称或公式位于第一行的前17个字符中，但示例中的名称以第26个字符结尾。

如果每条记录的第一行是您的示例和记录格式定义的某种变体，也许您可以尝试以下操作：

species_list = {'Tetrachloroethylene', 'Bar', 'Foo'}

如何在Python中从格式不规则的文本数据库中检索数据？

2 个答案: