正则表达式与文本文件

时间:2017-08-03 20:05:19

标签: regex python-3.x

我有以下代码用于从代码中列出的此数据字符串文件中提取字段,我尝试获取 Total Amt。 $ 179720 ,但不会仅停止在数字...将从Total Amt提取文件的其余部分直到结束。摘要包括......,是否可以从中提取 179720

目前仅在下面提取此行但无法获取提取的数字...

  

['Total Amt。 :$ 179720(估计)\ n研究员:Stephen R. Palumbi(首席研究员')

############################################################
pat_Amt=re.compile('Total Amt.*Investigator',re.M|re.DOTALL)
#print(pat_Amt)
Amt_term=pat_Amt.findall(data)
print(Amt_term)


Amt=int(filter(str.isdigit, Amt_term))
### Converting list to string
Amt=''.join(Amt)
############################################################

完整代码

import re
import pprint
import os

from collections import defaultdict
from pprint      import pprint


data = """Title       : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales:
               Mitochondrial DNA and Historical Demography
Type        : Award
NSF Org     : DEB 
Latest
Amendment
Date        : August 1,  1991     
File        : a9000006
Award Number: 9000006
Award Instr.: Continuing grant                             
Prgm Manager: Scott Collins                           
          DEB  DIVISION OF ENVIRONMENTAL BIOLOGY       
          BIO  DIRECT FOR BIOLOGICAL SCIENCES          
Start Date  : June 1,  1990       
Expires     : November 30,  1992   (Estimated)
Expected
Total Amt.  : $179720             (Estimated)
Investigator: Stephen R. Palumbi   (Principal Investigator current)
Sponsor     : U of Hawaii Manoa
          2530 Dole Street
          Honolulu, HI  968222225    808/956-7800

NSF Program : 1127      SYSTEMATIC & POPULATION BIOLO
Fld Applictn: 0000099   Other Applications NEC                  
              61        Life Science Biological                 
Program Ref : 9285,
Abstract    :

              Commercial exploitation over the past two hundred years drove                  
              the great Mysticete whales to near extinction.  Variation in                   
              the sizes of populations prior to exploitation, minimal                        
              population size during exploitation and current population                     
              sizes permit analyses of the effects of differing levels of                    
              exploitation on species with different biogeographical                         
              distributions and life-history characteristics.  Dr. Stephen                   
              Palumbi at the University of Hawaii will study the genetic                     
              population structure of three whale species in this context,                   
              the Humpback Whale, the Gray Whale and the Bowhead Whale.  The                 
              effect of demographic history will be determined by comparing                  
              the genetic structure of the three species.  Additional studies                
              will be carried out on the Humpback Whale.  The humpback has a                 
              world-wide distribution, but the Atlantic and Pacific                          
              populations of the northern hemisphere appear to be discrete                   
              populations, as is the population of the southern hemispheric                  
              oceans.  Each of these oceanic populations may be further                      
              subdivided into smaller isolates, each with its own migratory                  
              pattern and somewhat distinct gene pool.  This study will                      
              provide information on the level of genetic isolation among                    
              populations and the levels of gene flow and genealogical                       
              relationships among populations.  This detailed genetic                        
              information will facilitate international policy decisions                     
              regarding the conservation and management of these magnificent                 
              mammals."""

year_list=[]
Total_Amt_list=[]
abstract_list=[]


for line in data:
    #print(line)

    #pat_file=re.compile('File.*',re.M|re.DOTALL)
    #file=pat_file.findall(data)
    #file=''.join(file)
    #print(file)
    #print(type(file))




    pat_abstract=re.compile('Abstract.*',re.M|re.DOTALL)
    abstract=pat_abstract.findall(data)
    abstract=''.join(abstract)
    #print(abstract)
    #print(type(abstract))


    pat_year=re.compile('Start Date.*Expires',re.M|re.DOTALL)
    year_term=pat_year.findall(data)

    ### Converting list to string
    year_term=''.join(year_term)

    ### Finding the start year. The result of the findall is a list
    year=re.findall('[1-2][0-9][0-9][0-9]',year_term)

    ###converting list to integer
    for item in year:
        year=int(item)

    #print(type(year))
    #print(year)

    # Creating lists for filename, year and abstract    
    # filename is saved for reference
    #file_list.append(file)
    #print(file_list)
    year_list.append(year)
    #print(year)
    abstract_list.append(abstract)
    #print(abstract_list)

2 个答案:

答案 0 :(得分:2)

如果你能为你想要的每个项目提供正确的正则表达式模式,你不必遍历每一行。

e.g。从行获得值179720123:

Total Amt.  : $179720123             (Estimated)

你可以这样做:

# first look for matching pattern (Total Amt.  : $xxx) where xxx are some numbers
total_amt_line=re.findall('Total Amt\.\s+: \$[0-9]*', data)

print(total_amt_line)
>>> ['Total Amt.  : $179720']

# if there are any matches found extract any number values out of it
if len(total_amt_line) > 0:
  total_amt = re.findall('[0-9]+', total_amt_line[0])
  print(total_amt)
  >>> ['179720']

答案 1 :(得分:1)

您要寻找的模式是:r"\$(\d+(?:\.\d+)?)"