我正在尝试从科学论文摘要(available here)的文本语料库中读取数据。我在下面发布了一个示例文件,其中我用
读取了数据with open(filePath, "r") as f:
data = f.readlines()
for i, x in enumerate(data): print i, x
我想只提取第25行的类别名称和摘要中的文本;所以在下面的示例中将是("Commercial exploitation over the...", "Life Science Biological")
。我不能假设类别名称和摘要总是出现在这些特定的行号上。摘要将始终跟在Abstract
之后的2行,并运行到文件的末尾。
0 Title : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales:
1 Mitochondrial DNA and Historical Demography
2 Type : Award
3 NSF Org : DEB
4 Latest
5 Amendment
6 Date : August 1, 1991
7 File : a9000006
8
9 Award Number: 9000006
10 Award Instr.: Continuing grant
11 Prgm Manager: Scott Collins
12 DEB DIVISION OF ENVIRONMENTAL BIOLOGY
13 BIO DIRECT FOR BIOLOGICAL SCIENCES
14 Start Date : June 1, 1990
15 Expires : November 30, 1992 (Estimated)
16 Expected
17 Total Amt. : $179720 (Estimated)
18 Investigator: Stephen R. Palumbi (Principal Investigator current)
19 Sponsor : U of Hawaii Manoa
20 2530 Dole Street
21 Honolulu, HI 968222225 808/956-7800
22
23 NSF Program : 1127 SYSTEMATIC & POPULATION BIOLO
24 Fld Applictn: 0000099 Other Applications NEC
25 61 Life Science Biological
26 Program Ref : 9285,
27 Abstract :
28
29 Commercial exploitation over the past two hundred years drove
30 the great Mysticete whales to near extinction. Variation in
31 the sizes of populations prior to exploitation, minimal
32 population size during exploitation and current population
33 sizes permit analyses of the effects of differing levels of
34 exploitation on species with different biogeographical
35 distributions and life-history characteristics. Dr. Stephen
36 Palumbi at the University of Hawaii will study the genetic
37 population structure of three whale species in this context,
38 the Humpback Whale, the Gray Whale and the Bowhead Whale. The
39 effect of demographic history will be determined by comparing
40 the genetic structure of the three species. Additional studies
41 will be carried out on the Humpback Whale. The humpback has a
42 world-wide distribution, but the Atlantic and Pacific
43 populations of the northern hemisphere appear to be discrete
44 populations, as is the population of the southern hemispheric
45 oceans. Each of these oceanic populations may be further
46 subdivided into smaller isolates, each with its own migratory
47 pattern and somewhat distinct gene pool. This study will
48 provide information on the level of genetic isolation among
49 populations and the levels of gene flow and genealogical
50 relationships among populations. This detailed genetic
51 information will facilitate international policy decisions
52 regarding the conservation and management of these magnificent
53 mammals
更新:以下代码适用于我,但是有更有效的方法吗? 使用open(filePath,“r”)作为f: data = f.readlines()
# Find the abstract and category
abstract = re.compile("Abstract")
for i, line in enumerate(data):
if abstract.search(line): break
# i is the line number of the "Abstract" identifier
temp = "".join(data[i+1:])
abstractText = " ".join(re.findall('[A-Za-z]+', temp))
category = " ".join(re.findall('[A-Za-z]+', data[i-2]))
return abstractText, category
答案 0 :(得分:1)
你有什么尝试过?
如果格式一致,则可以使用正则表达式执行此操作。
可以捕获摘要的示例如下:
abstract = re.compile(u"Abstract:([\s\w\d]*)", re.MULTILINE)
上面的代码假定抽象文本之后没有任何内容,并且摘要的主体总是由"抽象:"