BeautifulSoup:从JATS XML中提取图形和标题

时间:2017-06-04 14:14:54

标签: python xml beautifulsoup

我想从JATS XML获取图像和它的描述。在我的示例中,我使用http://journal.frontiersin.org/article/10.3389/fpls.2011.00008/xml/nlm

数字的格式如下:

<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>Pathways of DSB misrepair...</p></caption>
<graphic xlink:href="fpls-02-00008-g001.tif"/>
</fig>

我想获得每个数字<caption>...</caption><graphic xlink:href="..."/>的内容。

所以我的想法是使用BeautifoulSoup的css选择器并在打印时去除html标签:

#!/usr/bin/python

from bs4 import BeautifulSoup
import urllib.request

content = urllib.request.urlopen('file:///tmp/fpls-02-00008.xml').read()
soup = BeautifulSoup(content, 'xml')

##<fig><caption>XXX</caption></fig>
caption = soup.select("fig caption")

##<fig><graphic xlink:href="YYY"/></fig>
graphic = soup.select("fig graphic")

for a in caption:
    print(a.get_text().strip())

#print(b.get_text()) doesn't work
for b in graphic:
    print(b)

#separator = "|"
#print(separator.join([caption, graphic]))

只获取字幕或仅显示图形,但由于源代码中的不一致,我需要立即获取它们。结果不应该是

  • 标题A
  • 字幕B
  • graphic A
  • graphic B

而是

  • 标题A,图形A
  • 字幕B,图B

我如何实现这一目标?提前谢谢!

2 个答案:

答案 0 :(得分:0)

您可以使用zip一次循环浏览两个列表:

>>> A = [1,2,3,4,5]
>>> B = ['A','B','C','D','E']
>>> for number,letter in zip(A,B):
...     print number,letter
... 
1 A
2 B
3 C
4 D
5 E
>>> 

答案 1 :(得分:0)

您可以先选择fig元素,然后在同一循环中选择captiongraphic

fig = soup.select("fig")
for e in fig:
    print(e.select('caption')[0].get_text().strip())
    print(e.select('graphic')[0]['xlink:href'])

输出:

Pathways of DSB misrepair via single-strand annealing(SSA) or via synthesis-dependent strand annealing (SDSA). (A) Deletion via exonucleolytic 5′-end resection, SSA at complementary overhang sequences, resection of the non-aligned ends, and ligation of break-ends. (B) Insertion into a DSB by break-end invasion and elongationalong an ectopic and partially homologous (vertical bars) template.(C) Re-synthesis of break-ends after invasion into a homologous template double-strand without (gene conversion) or with exchange of flanking regions due to appropriate resolution of Holiday junctions (greenarrow heads).
fpls-02-00008-g001.tif
Schematic models of replication and chromosome labeling patterns after BIR at proximal DSB ends in S and G2. (A) BIR through conservative replication of a one ended DSB during S phase. The DSB appears when the replication fork arrives at a single-strand break (arrow head). Conservative replication occurs via recurrent strand invasion (or via unidirectional fork migration) without resolution of the Holiday junction(s) using the parental double strand as a template. The result after EdU incorporation is an asymmetrically unlabeled terminal chromatid region. (B) BIR during G2 phase, through conservative replication at the proximal end of a DSB (arrow head) via recurrent strand invasion and/or via unidirectional fork migration without resolution of the Holiday junction(s) using the undamaged sister double helix as a template. The result after EdU incorporation is an asymmetrically labeled terminal chromatid region. (C) BIR during G2 phase through semiconservative replication achieved by resolution of the Holiday junction (green arrow head) after invasion of the elongating break-end into the template double strand. The result after EdU incorporation is a symmetrically labeled distal chromatid region. Full lines unlabeled; broken lines labeled by EdU. The distal fragment of the broken double helix in (B,C) gets lost.
fpls-02-00008-g002.tif
Metaphase chromosomes of the field bean. (A) Chromatid-type aberrations after bleomycin treatment. Left cell: isochromatid break (arrow head), the centric, and the acentric chromatid fragments are surrounded by black dots, the homologous undamaged chromosome is surrounded by white dots. Middle cell: symmetric reciprocal chromatid translocation (arrow) and two terminal chromatid breaks (arrow heads). The latter with the broken fragment either switched to the opposite site of the undamaged sister chromatid (left) or being at least 90° apart from the other break-end as in case of the broken secondary constriction (right). Right cell: interstitial deletion (arrow), the deleted fragment remains attached to the undamaged sister chromatid, the chromosome involved is surrounded by black dots. (B) Interstitial asymmetric chromatid labeling (arrows) after bleomycin treatment in the presence of EdU during S phase. (C) Interstitial asymmetric chromatid labeling (arrows) after bleomycin treatment in the presence of EdU during G2. The asymmetric signals appear on chromosomes II, IV, V, and VI, respectively, at interstitial heterochromatic regions composed of homologous tandem repeats (Fuchs et al., 1994).
fpls-02-00008-g003.tif