Question

我有一个包含数千行信息的output文件。我经常在输出文件中找到以下形式的信息

Input Orientation:
...
content
...
Distance matrix (angstroms):

我现在想将content保存到变量中以便后续格式化。另一件事是我只对我文件中的 last 模式感兴趣。我有一个使用sed和awk执行此操作的解决方案，但这导致我为执行一项工作而放置多个文件。这个工作应该可以用python来完成，但我不知道从哪里开始阅读并学习这个。

EDIT 我一直在阅读正则表达式，不管你信不信我都取得了一些进展！我首先逐行读取文件，然后反转列表，然后加入组成该列表的所有字符串。我现在最终只有一个大的多行字符串。接下来我使用re模块制作我的正则表达式r'Distance matrix(.*?)Input orientation'，我认为这意味着以下内容：我的第一个模式是“距离矩阵”，然后是一个子模式，其中所有字符中的零个或多个匹配，但是以懒惰的方式（在第一场比赛后停止），然后我的最后一个模式“输入方向”。

with open(inputfile,"r") as input_file:
        input_file_lines = input_file.readlines()
        reverse_lines = input_lines[::-1]
        string = ''.join(reverse_lines)

        match = re.search('Distance matrix(.*?)Input orientation', string, re.DOTALL).group(1)

用于测试的示例数据文件：

Item               Value     Threshold  Converged?
             Maximum Force            0.005032     0.000450     NO
             RMS     Force            0.001066     0.000300     NO
             Maximum Displacement     0.027438     0.001800     NO
             RMS     Displacement     0.007282     0.001200     NO
             Predicted change in Energy=-8.909077D-05
             GradGradGradGradGradGradGradGradGradGradGradGradGradGradGradGradGradGrad

                                      Input orientation:
             ---------------------------------------------------------------------
             Center     Atomic      Atomic             Coordinates (Angstroms)
             Number     Number       Type             X           Y           Z
             ---------------------------------------------------------------------
                  1          6           0        Incorrect    Incorrect    Incorrect
                  2          1           0        Incorrect    Incorrect    Incorrect
                  3          1           0        Incorrect    Incorrect    Incorrect
                  4          1           0        Incorrect    Incorrect    Incorrect
                  5         17           0        Incorrect    Incorrect    Incorrect
                  6          9           0        Incorrect    Incorrect    Incorrect
             ---------------------------------------------------------------------
                                Distance matrix (angstroms):
                                1          2          3          4          5
                 1  C    0.000000
                 2  H    1.080163   0.000000
                 3  H    1.080326   1.809416   0.000000
                 4  H    1.080621   1.810236   1.810685   0.000000
                 5  Cl   1.962171   2.470702   2.468769   2.465270   0.000000
                 6  F    2.390537   2.343910   2.357275   2.380515   4.352568
                                6
                 6  F    0.000000

                                          Input orientation:
                 ---------------------------------------------------------------------
                 Center     Atomic      Atomic             Coordinates (Angstroms)
                 Number     Number       Type             X           Y           Z
                 ---------------------------------------------------------------------
                      1          6           0        Correct    Correct     Correct
                      2          1           0        Correct    Correct     Correct
                      3          1           0        Correct    Correct     Correct
                      4          1           0        Correct    Correct     Correct
                      5         17           0        Correct    Correct     Correct
                      6          9           0        Correct    Correct     Correct
                 ---------------------------------------------------------------------
                                    Distance matrix (angstroms):
                                    1          2          3          4          5
                     1  C    0.000000
                     2  H    1.080516   0.000000
                     3  H    1.080587   1.801890   0.000000
                     4  H    1.080473   1.801427   1.801478   0.000000
                     5  Cl   1.936014   2.458132   2.459437   2.460630   0.000000
                     6  F    2.414588   2.368281   2.365651   2.355690   4.350586

Answer 1

此处不需要正则表达式。所有你需要的是良好的索引。 Python字符串有index and rindex methods，它接受子字符串，在文本中找到它，并返回子字符串中第一个字符的索引。阅读this doc应该让你熟悉切片字符串。该程序看起来像这样：

ng build prod --aot=false

该代码的最后一行从文件的 end 开始首次出现with open(input_file) as f: s = f.read() # reads the file as one big string last_block = s[s.rindex('Input'):s.rindex('Distance')]，因为我们使用了'Input'，并向前移动并标记该位置为整数。然后它与rindex一样。然后它使用这些整数只返回它们之间的字符串部分。对于您的示例文件，它将返回：

'Distance'

如果您不想使用Input orientation: --------------------------------------------------------------------- Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z --------------------------------------------------------------------- 1 6 0 Correct Correct Correct 2 1 0 Correct Correct Correct 3 1 0 Correct Correct Correct 4 1 0 Correct Correct Correct 5 17 0 Correct Correct Correct 6 9 0 Correct Correct Correct ---------------------------------------------------------------------标题，只需添加'Input orientation'的结果，直到获得所需的结果。例如，这可能看起来像rindex('Input')。

如果未找到子字符串，请注意s[s.rindex('Input') + 19:s.rindex('Distance')]和index抛出错误也很重要。如果不需要，您可以使用rindex和find。

匹配模式并使用python

1 个答案: