切片行并将参数保存到不同的文件中

时间:2017-09-03 22:44:43

标签: python parsing io slice text-extraction

我有一个g.out文件(粘贴在下面)。

此文件包含我想要提取的几个FINAL OPTIMIZED几何图形。

对于给定的FINAL OPTIMIZED GEOMETRY,这些突出显示的值是我想要提取的值:

enter image description here

我已在下面的程序中设法提取前三个:VOLUMEA以及B

我的代码:

import os
import sys
import re

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'


VOLUMES = []
P0 = []
P2 = []
atomic_number = []
coord_x = []
coord_y = []
coord_z = []

with open('g.out') as file:
    for line in file:
        if re.match(initial_pattern, line):
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)

        if re.match(middle_pattern, line):
            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

            first_coord_line = file.next()
            print first_coord_line

        if re.match(end_pattern, line):
            end_pattern = line
            print end_pattern
            all_coordinates =  [first_coord_line:end_pattern]
            for line in all_coordinates:
              del('F ')             # delete those that contain 'F '
              aux2 =  line.split()
              coords = []


sys.exit()
#Template = 
"""
some stuff
other stuff
p0      p2
3
A    B        C         D
E    F        G         H
I    J        K         L
other stuff
some other stuff
"""

我无法提取COORDINATES,因为我找不到从first_coord_lineend_pattern切片的方法,就像在这个伪代码中一样:

if re.match(end_pattern, line):
    end_pattern = line
    print end_pattern
    all_coordinates =  [first_coord_line:end_pattern]
    for line in all_coordinates:
      del('F ')             # delete those that contain 'F '
      aux2 =  line.split()  # split lines
      atomic_number = aux2[2]
      coord_x = aux2[4]
      coord_y = aux2[5]
      coord_z = aux2[6]

有没有办法实现这个伪代码?

在我的代码中,VOLUMESP0P2atomic_numbercoord_xcoord_y coord_z已初始化为列表因为在结束for循环之前我想保存在不同的文件中,以“VOLUME。inp”的名称命名,这个信息:

#Template = 
"""
some stuff
other stuff
p0      p2
3
A    B        C         D
E    F        G         H
I    J        K         L
other stuff
some other stuff
"""

其中p0p2是我的代码中提取的值(屏幕截图中的第2和第3个突出显示的值),而A - L是{{1 }}和atomic_numbercoord_xcoord_y

有没有办法实现这个目标?

coord_z文件:

g.out

更新代码:

基于@nos flag的方法,以下代码能够提取信息。 more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3 A B C ALPHA BETA GAMMA 6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03 7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01 8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03 10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054) A B C ALPHA BETA GAMMA 4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02 7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02 8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02 10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3 A B C ALPHA BETA GAMMA 6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03 7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01 8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03 10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599) A B C ALPHA BETA GAMMA 4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02 7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02 8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02 10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines 是2个元素的列表。 以下列表是结果:

VOLUMES

这篇文章的第二部分是撰写此信息(VOLUMES = ['119.823364', '121.143469'] P0 = ['4.97568007', '4.98494429'] P2 = ['16.76591397', '16.88768068'] Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8'] P0P2ATOMIC_NUMBERSXsYs )在两个Zs文件中。换句话说,比如:

VOLUME.inp档案:

V_119.823364.inp

some stuff other stuff 4.97568007 4.98494429 3 20 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 6 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 8 -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 other stuff 档案:

V_121.143469.inp

基于@nos的some stuff other stuff 4.97568007 4.98494429 3 20 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 6 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 8 -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 other stuff atoms_per_frame建议,我尝试了以下代码。我发现在文件中按元素编写有困难,即:

atoms_all_frames

1 个答案:

答案 0 :(得分:1)

有很多方法可以做到这一点。最重要的是要区分你是否已经通过了mid_pattern,因为在它之前和之后都存在相同的坐标模式,并且只有它之后才需要。

例如,你可以

  1. 设置一个标记,以便我们知道mid_pattern已匹配
  2. end_pattern匹配

    分支
    passed_mid_point = False
    ...
    if re.match(middle_pattern, line):
        passed_mid_point = True
        # do what you need
        ...
    if re.match(end_pattern, line):
        passed_mid_point = False # so you can process a new frame
        # do what you need after end pattern is matched
        ...
    elif passed_mid_point:
        # parse the coordinates
        terms = line.split()
        if terms and terms[1] == 'T':
            x = float(terms[4])
            y = float(terms[5])
            z = float(terms[6])
    
  3. 或者,您可以标记和匹配,如下所示:

        passed_mid_point = False
        coord_patter = r'      \d+ T '
        ...
        if re.match(middle_pattern, line):
            passed_mid_point = True
            # do what you need
            ...
        if re.match(end_pattern, line):
            passed_mid_point = False # so you can process a new frame
            # do what you need after end pattern is matched
            ...
        if passed_mid_point and re.match(coord_pattern, line):
            # parse the coordinates
            terms = line.split()
            if terms and terms[1] == 'T':
                x = float(terms[4])
                y = float(terms[5])
                z = float(terms[6])
    

    坐标匹配也可以在正则表达式中完全完成

    sci_num = r'-?\d+\.\d*E[+\-]\d+'
    coord_pattern = r'\s+\d+\sT\s+\d+\s+[A-Z]+\s+(%s)\s+(%s)\s+(%s)' % (sci_num, sci_num, sci_num)
    coord_re = re.compile(coord_pattern)
    if coord_re.match(line):
        x = float(coord_re.group(1))
        y = float(coord_re.group(2))
        z = float(coord_re.group(3))
    

    为了记录数据,最好跟踪原子坐标所属的帧。例如,您可以在开头创建atom_frames。并保持附加原子坐标列表,其中每个列表对应一个帧。总的来说,它看起来像这样

    atom_frames = []
    for i in range(50): # here I assume 50 frames
        current_frame = []
        for a in atoms_in_this_frame:
            current_frame.append(a)  # a could be (x, y, z) of an atom
        atom_frames.append(current_frame)
    

    这里我只是循环帧计数。在您的情况下,您可以在点击current_frame = []时创建mid_pattern。点击atom_frames.append(current_frame)end_pattern。希望它有意义。