我有以下代码用于从代码中列出的此数据字符串文件中提取字段,我尝试获取 Total Amt。: $ 179720 ,但不会仅停止在数字...将从Total Amt提取文件的其余部分直到结束。摘要包括......,是否可以从中提取 179720 ?
目前仅在下面提取此行但无法获取提取的数字...
['Total Amt。 :$ 179720(估计)\ n研究员:Stephen R. Palumbi(首席研究员')
############################################################
pat_Amt=re.compile('Total Amt.*Investigator',re.M|re.DOTALL)
#print(pat_Amt)
Amt_term=pat_Amt.findall(data)
print(Amt_term)
Amt=int(filter(str.isdigit, Amt_term))
### Converting list to string
Amt=''.join(Amt)
############################################################
完整代码
import re
import pprint
import os
from collections import defaultdict
from pprint import pprint
data = """Title : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales:
Mitochondrial DNA and Historical Demography
Type : Award
NSF Org : DEB
Latest
Amendment
Date : August 1, 1991
File : a9000006
Award Number: 9000006
Award Instr.: Continuing grant
Prgm Manager: Scott Collins
DEB DIVISION OF ENVIRONMENTAL BIOLOGY
BIO DIRECT FOR BIOLOGICAL SCIENCES
Start Date : June 1, 1990
Expires : November 30, 1992 (Estimated)
Expected
Total Amt. : $179720 (Estimated)
Investigator: Stephen R. Palumbi (Principal Investigator current)
Sponsor : U of Hawaii Manoa
2530 Dole Street
Honolulu, HI 968222225 808/956-7800
NSF Program : 1127 SYSTEMATIC & POPULATION BIOLO
Fld Applictn: 0000099 Other Applications NEC
61 Life Science Biological
Program Ref : 9285,
Abstract :
Commercial exploitation over the past two hundred years drove
the great Mysticete whales to near extinction. Variation in
the sizes of populations prior to exploitation, minimal
population size during exploitation and current population
sizes permit analyses of the effects of differing levels of
exploitation on species with different biogeographical
distributions and life-history characteristics. Dr. Stephen
Palumbi at the University of Hawaii will study the genetic
population structure of three whale species in this context,
the Humpback Whale, the Gray Whale and the Bowhead Whale. The
effect of demographic history will be determined by comparing
the genetic structure of the three species. Additional studies
will be carried out on the Humpback Whale. The humpback has a
world-wide distribution, but the Atlantic and Pacific
populations of the northern hemisphere appear to be discrete
populations, as is the population of the southern hemispheric
oceans. Each of these oceanic populations may be further
subdivided into smaller isolates, each with its own migratory
pattern and somewhat distinct gene pool. This study will
provide information on the level of genetic isolation among
populations and the levels of gene flow and genealogical
relationships among populations. This detailed genetic
information will facilitate international policy decisions
regarding the conservation and management of these magnificent
mammals."""
year_list=[]
Total_Amt_list=[]
abstract_list=[]
for line in data:
#print(line)
#pat_file=re.compile('File.*',re.M|re.DOTALL)
#file=pat_file.findall(data)
#file=''.join(file)
#print(file)
#print(type(file))
pat_abstract=re.compile('Abstract.*',re.M|re.DOTALL)
abstract=pat_abstract.findall(data)
abstract=''.join(abstract)
#print(abstract)
#print(type(abstract))
pat_year=re.compile('Start Date.*Expires',re.M|re.DOTALL)
year_term=pat_year.findall(data)
### Converting list to string
year_term=''.join(year_term)
### Finding the start year. The result of the findall is a list
year=re.findall('[1-2][0-9][0-9][0-9]',year_term)
###converting list to integer
for item in year:
year=int(item)
#print(type(year))
#print(year)
# Creating lists for filename, year and abstract
# filename is saved for reference
#file_list.append(file)
#print(file_list)
year_list.append(year)
#print(year)
abstract_list.append(abstract)
#print(abstract_list)
答案 0 :(得分:2)
如果你能为你想要的每个项目提供正确的正则表达式模式,你不必遍历每一行。
e.g。从行获得值179720123:
Total Amt. : $179720123 (Estimated)
你可以这样做:
# first look for matching pattern (Total Amt. : $xxx) where xxx are some numbers
total_amt_line=re.findall('Total Amt\.\s+: \$[0-9]*', data)
print(total_amt_line)
>>> ['Total Amt. : $179720']
# if there are any matches found extract any number values out of it
if len(total_amt_line) > 0:
total_amt = re.findall('[0-9]+', total_amt_line[0])
print(total_amt)
>>> ['179720']
答案 1 :(得分:1)
您要寻找的模式是:r"\$(\d+(?:\.\d+)?)"
。