在两个字符串之间提取和处理信息,这些字符串在文件中重复多次

时间:2017-09-06 19:32:58

标签: python list parsing file-io text-extraction

我有一个这种结构的文件:

 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   122.771603 - DENSITY  2.704 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32540491     6.32540491     6.32540491    46.774144  46.774144  46.774144
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.912600492192E-01 -8.739950780750E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03
      7 F   8 O    -8.739950780750E-03  2.500000000000E-01 -4.912600492193E-01
      8 F   8 O     4.912600492193E-01  8.739950780750E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.912600492193E-01  8.739950780750E-03
     10 F   8 O     8.739950780750E-03 -2.500000000000E-01  4.912600492193E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        368.31480902)
         A              B              C           ALPHA      BETA       GAMMA
     5.02162261     5.02162261    16.86554607    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    0.000000000000E+00  0.000000000000E+00 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02
      7 F   8 O     7.459338255258E-02  4.079267158859E-01 -8.333333333333E-02
      8 F   8 O     4.079267158859E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.459338255258E-02  8.333333333333E-02
     10 F   8 O    -7.459338255258E-02 -4.079267158859E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT


more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   119.823364 - DENSITY  2.770 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.28373604     6.28373604     6.28373604    46.646397  46.646397  46.646397
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.924094276183E-01 -7.590572381674E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03
      7 F   8 O    -7.590572381674E-03  2.500000000000E-01 -4.924094276183E-01
      8 F   8 O     4.924094276183E-01  7.590572381674E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.924094276183E-01  7.590572381674E-03
     10 F   8 O     7.590572381674E-03 -2.500000000000E-01  4.924094276183E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        359.47009054)
         A              B              C           ALPHA      BETA       GAMMA
     4.97568007     4.97568007    16.76591397    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02
      7 F   8 O     7.574276095166E-02  4.090760942850E-01 -8.333333333333E-02
      8 F   8 O     4.090760942850E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.574276095166E-02  8.333333333333E-02
     10 F   8 O    -7.574276095166E-02 -4.090760942850E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   121.143469 - DENSITY  2.740 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32229536     6.32229536     6.32229536    46.436583  46.436583  46.436583
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.927088991116E-01 -7.291100888437E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03
      7 F   8 O    -7.291100888437E-03  2.500000000000E-01 -4.927088991116E-01
      8 F   8 O     4.927088991116E-01  7.291100888437E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.927088991116E-01  7.291100888437E-03
     10 F   8 O     7.291100888437E-03 -2.500000000000E-01  4.927088991116E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        363.43040599)
         A              B              C           ALPHA      BETA       GAMMA
     4.98494429     4.98494429    16.88768068    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02
      7 F   8 O     7.604223244490E-02  4.093755657782E-01 -8.333333333333E-02
      8 F   8 O     4.093755657782E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.604223244490E-02  8.333333333333E-02
     10 F   8 O    -7.604223244490E-02 -4.093755657782E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

我想提取CRYSTALLOGRAPHIC CELL的信息;但只有来自FINAL OPTIMIZED GEOMETRY的那个。

以下3场比赛:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

允许搜索信息。

首先,我定义了一个标志passed_mid_point = False

然后该程序的以下部分提取VOLUME的{​​{1}}的{​​{1}}:

FINAL OPTIMIZED GEOMETRY

这是正确的,因为CRYSTALLOGRAPHIC CELL。请注意,初始VOLUMES = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) print 'VOLUMES = ', VOLUMES (请参阅原始文件)未按预期收集在列表中。

在提取VOLUMES = ['119.823364', '121.143469']122.771603(在我的计划中AC)时,P0的{​​{1}}参数,连同坐标:

P1

结果如下:

FINAL OPTIMIZED GEOMETRY

这是错误的,因为CRYSTALLOGRAPHIC CELL不是来自 if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS (参见文件)。

坐标也错了:

P0 =  ['5.02162261', '4.97568007', '4.98494429']

这将是理想的结果:

5.02162261

如果你能帮助我,我将不胜感激

整个代码:

FINAL OPTIMIZED GEOMETRY

2 个答案:

答案 0 :(得分:1)

我写了一个简化版的脚本,看起来很有用。我希望这可以作为你最终剧本的起点:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open('g.out') as gout:
    final_optimized_geometry = False
    for line in gout:
        if 'FINAL OPTIMIZED GEOMETRY' in line:
            final_optimized_geometry = True
        elif 'PRIMITIVE CELL' in line:
            if not final_optimized_geometry:
                continue
            volume = line.split()[7]
            VOLUMES.append(volume)
        elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line:
            if not final_optimized_geometry:
                continue
            gout.readline()
            line = gout.readline()
            p0, p2 = line.split()[0:3:2]

            P0.append(p0)
            P2.append(p2)
        elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
            if not final_optimized_geometry:
                continue
            gout.readline()
            gout.readline()
            while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
                line = gout.readline()
                atomdata = line.split()
                if not atomdata or atomdata[1] != 'T':
                    continue
                atomicnumber = atomdata[2]
                x, y, z = atomdata[4:7]
                ATOMIC_NUMBERS.append(atomicnumber)
                Xs.append(x)
                Ys.append(y)
                Zs.append(z)
            final_optimized_geometry = False


print(VOLUMES)
print(P0)
print(P2)
print(ATOMIC_NUMBERS)
print(Xs)
print(Ys)
print(Zs)

这会生成以下输出:

['119.823364', '121.143469']
['4.97568007', '4.98494429']
['16.76591397', '16.88768068']
['20', '6', '8', '20', '6', '8']
['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

事实上,它是一个非常简单的有限状态机,只有两个状态。警告:如果在一个最终优化的几何体中存在多个晶体细胞,则它将不起作用。在这种情况下,它只会捕获第一个单元格的信息。

该代码还对该文件做出了其他假设,当然也许需要进行验证。

我避免使用正则表达式。

此代码仅在Python 3中运行(针对Python 3.6.2进行测试)。 Python 2.7将在文件迭代块中使用readline()(这种方式有意义,但很高兴看到Python 3可以使用它)。我们使用readline()作为一个小的黑客来跳过我们知道的输入文件中的行必须被跳过,而不再遍历整个循环(这将需要更多的标志变量)。

顺便说一句,如果您的唯一任务是解析文本文件,那么查看专用语言可能会很有趣,例如Lex。另外,Perl是为了做这样的事情而设计的,比Python更多。

希望这有帮助!

答案 1 :(得分:0)

感谢所有@Bart Van Loon的帮助,一个更简单的代码版本将是:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

global N_atom_irreducible_unit
N_atom_irreducible_unit = 3

filename = 'g.out'

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open(filename) as gout:
    final_optimized_geometry = False
    for line in gout:
        if 'FINAL OPTIMIZED GEOMETRY' in line:
            final_optimized_geometry = True
        elif 'PRIMITIVE CELL - CENTRING CODE' in line:
            if final_optimized_geometry:
                volume = line.split()
                print volume
                print volume[7]
                volume = line.split()[7]
                VOLUMES.append(volume)

        elif ' CRYSTALLOGRAPHIC CELL (V' in line:
            if final_optimized_geometry:
                print 'gout.next() =', gout.next()
                done = gout.next()
                print 'done =', done
                p0 = done.split()[0]
                p2 = done.split()[2]

#               p0, p2 = done.split()[0:3:2]

                P0.append(p0)
                P2.append(p2)
        elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
            if final_optimized_geometry:
                gout.next()
                gout.next()
                while True:
                    line = gout.next()
                    atomdata = line.split()
                    if not atomdata:
                        break
                    if atomdata[1] != 'T':
                        continue
                    atomicnumber = atomdata[2]
                    x, y, z = atomdata[4:7]
                    ATOMIC_NUMBERS.append(atomicnumber)
                    Xs.append(x)
                    Ys.append(y)
                    Zs.append(z)
                final_optimized_geometry = False



print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

其中:

1)因为最后一个原子之后的下一行(本例中为第10个原子)是一个空行,

                    if not atomdata:
                        break
atomdata为空时,

将始终停止。换句话说,当空行,即当原子列表结束时,这将始终停止。因此,这将允许避免while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:语句。

类似的陈述是:

                    if  atomdata:   
                        continue

然而,由于某些原因我不明白,这不能将非空白行解释为唯一需要分析的空白行。为什么呢?

2)这部分代码:

                if atomdata[1] != 'T':
                    continue
                atomicnumber = atomdata[2]
                x, y, z = atomdata[4:7]
                ATOMIC_NUMBERS.append(atomicnumber)
                Xs.append(x)
                Ys.append(y)
                Zs.append(z)

也可以表示为:

              if atomdata[1] == 'T':
                  atomicnumber = atomdata[2]
                  x, y, z = atomdata[4:7]
                  ATOMIC_NUMBERS.append(atomicnumber)
                  Xs.append(x)
                  Ys.append(y)
                  Zs.append(z)