如何从许多文本文件中提取特定信息

时间:2018-03-23 14:03:56

标签: ruby

我有超过200个文件。例如,其中一个如下 它们是txt文件。我想逐个阅读它们,然后从中获取特定信息并将其导出到xls文件

例如,如何在xls文件中获取以下信息

     TOTAL ENERGY            =       -444.38126 EV
      ELECTRONIC ENERGY       =       -840.31531 EV
      CORE-CORE REPULSION     =        395.93406 EV
      GRADIENT NORM           =          0.91931 = 0.45965 PER ATOM
      DIPOLE                  =          2.66600 DEBYE    POINT GROUP:       C2v 
      NO. OF FILLED LEVELS    =          6
      IONIZATION POTENTIAL    =         10.352991 EV
      HOMO LUMO ENERGIES (EV) =        -10.353  0.402
      MOLECULAR WEIGHT        =         30.0262
      COSMO AREA              =         60.70 SQUARE ANGSTROMS
      COSMO VOLUME            =         42.52 CUBIC ANGSTROMS

我阅读了几篇帖子,他们写道可以使用

sed -n ".." file.txt

问题是,即使我要使用它,也会花费我很长时间,因为我应该把当时的一个文件读成bash 然后我应该去找每个关键字,比如

          HEAT OF FORMATION 
          TOTAL ENERGY   
          ELECTRONIC ENERGY     
          CORE-CORE REPULSION  
          GRADIENT NORM        
          DIPOLE               
          NO. OF FILLED LEVELS   
          IONIZATION POTENTIAL   
          HOMO LUMO ENERGIES (EV) 
          MOLECULAR WEIGHT        
          COSMO AREA              
          COSMO VOLUME            

然后我将这一行逐一粘贴到xls文件及其相应的行信息

                     SUMMARY OF PM7 CALCULATION, Site No: 29451

                                                       MOPAC2016 (Version: 18.063M)
                                                       Tue Mar 20 15:08:13 2018
                                                       No. of days remaining = 349

           Empirical Formula: C H2 O  =     4 atoms

 SYMMETRY
 Formaldehyde



     GEOMETRY OPTIMISED USING EIGENVECTOR FOLLOWING (EF).     
     SCF FIELD WAS ACHIEVED                                   

          HEAT OF FORMATION       =        -25.54241 KCAL/MOL =    -106.86944 KJ/MOL
          TOTAL ENERGY            =       -444.38126 EV
          ELECTRONIC ENERGY       =       -840.31531 EV
          CORE-CORE REPULSION     =        395.93406 EV
          GRADIENT NORM           =          0.91931 = 0.45965 PER ATOM
          DIPOLE                  =          2.66600 DEBYE    POINT GROUP:       C2v 
          NO. OF FILLED LEVELS    =          6
          IONIZATION POTENTIAL    =         10.352991 EV
          HOMO LUMO ENERGIES (EV) =        -10.353  0.402
          MOLECULAR WEIGHT        =         30.0262
          COSMO AREA              =         60.70 SQUARE ANGSTROMS
          COSMO VOLUME            =         42.52 CUBIC ANGSTROMS

          MOLECULAR DIMENSIONS (Angstroms)

            Atom       Atom       Distance
            H     3    O     1     2.00299
            H     4    O     1     1.65067
            H     4    C     2     0.00000
          SCF CALCULATIONS        =          4
          WALL-CLOCK TIME         =          0.309 SECONDS
          COMPUTATION TIME        =          0.033 SECONDS


          FINAL GEOMETRY OBTAINED
 SYMMETRY
 Formaldehyde

  O     0.00000000 +0    0.0000000 +0    0.0000000 +0     0     0     0
  C     1.20614565 +1    0.0000000 +0    0.0000000 +0     1     0     0
  H     1.09115836 +1  121.2760970 +1    0.0000000 +0     2     1     0
  H     1.09115836 +0  121.2760970 +0  180.0000000 +0     2     1     3

   3  1    4
   3  2    4

我想将数据导出到一个csv中,并将每个数据导出到彼此之下,如下所示

data1
444.38126 EV
-840.31531 EV
395.93406 EV
0.91931 = 0.45965 PER ATOM
    2.66600 
    C2v 
    6
      10.352991
   -10.353  0.402
   30.0262
   60.70  
  42.52 

我知道如何逐行读取每个文件。让我们假设输出文件是output.txt

line_num=0
text=File.open('output.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
  print "#{line_num += 1} #{line}"
end

因此它可以逐行读取,现在我尝试提取这些信息

line_num=0
    text=File.open('output.txt').read
    text.gsub!(/\r\n?/, "\n")
    text.each_line do |line|
      if line[/TOTAL ENERGY/]
        puts line.split("=",2)[-1].strip
    end
    if line[/ELECTRONIC ENERGY/]
        toggle=1
        next
    end
    if line[/CORE-CORE REPULSION/]
        toggle=1
        next
    if line[/GRADIENT NORM/]
        toggle=1
        next
    if line[/DIPOLE/]
        toggle=1
        next
    if line[/NO. OF FILLED LEVELS/]
        toggle=1
        next
    if line[/IONIZATION POTENTIAL/]
        toggle=1
        next
    if line[/HOMO LUMO ENERGIES (EV)/]
        toggle=1
        next
    if line[/MOLECULAR WEIGHT /]
        toggle=1
        next
    if line[/COSMO AREA/]
        toggle=1
        next
    if line[/COSMO VOLUME/]
        toggle=1
        next

end

1 个答案:

答案 0 :(得分:0)

一定是红宝石?如何使用bash读取文件,将结果格式化为Excel?

例如:

for filename in *.txt; do
    awk '{print FILENAME ":" $0}' $filename | grep '[A-Z]\{3,\}.*=' >> r.csv
done

将创建 r.csv 文件,您可以使用菜单 Data - >在Excel中打开并格式化。列到的文本。

他们可以使用字符“=”作为列分隔符。