在XML标记中获取完整文本,因为文本以空格分隔

时间:2018-04-15 12:17:31

标签: python xml

我有一个XML文件,它是pdf2txt程序的输出[将pdf文件转换为XML文件]。

以下是它的示例输出。 注意这是一个来自它生成的大型XML文件的小样本,而且我只放了一小部分与此问题相关。

我想要的输出
我需要将<figure></figure>标记中包含的所有文字作为文字,以空格分隔。

例如:在这个例子中,我需要

$1,000,000  CAP022488-0214 10/29/2016 AUSTIN, TX 78705.

这些单词中的每一个都需要用空格分隔,我想将它们作为单独的字符捕获。

请告诉我如何在Python中执行此操作?

<figure name="Xi2" bbox="251.570,374.910,334.710,387.700">
<text font="Arial" bbox="253.570,374.390,259.130,387.700" size="13.310">$</text>
<text font="Arial" bbox="259.130,374.390,264.690,387.700" size="13.310">1</text>
<text font="Arial" bbox="264.690,374.390,267.470,387.700" size="13.310">,</text>
<text font="Arial" bbox="267.470,374.390,273.030,387.700" size="13.310">0</text>
<text font="Arial" bbox="273.030,374.390,278.590,387.700" size="13.310">0</text>
<text font="Arial" bbox="278.590,374.390,284.150,387.700" size="13.310">0</text>
<text font="Arial" bbox="284.150,374.390,286.930,387.700" size="13.310">,</text>
<text font="Arial" bbox="286.930,374.390,292.490,387.700" size="13.310">0</text>
<text font="Arial" bbox="292.490,374.390,298.050,387.700" size="13.310">0</text>
<text font="Arial" bbox="298.050,374.390,303.610,387.700" size="13.310">0</text>
</figure>
<figure name="Xi3" bbox="124.790,543.930,319.200,556.720">
<text font="Arial" bbox="126.790,543.410,134.010,556.720" size="13.310">C</text>
<text font="Arial" bbox="134.010,543.410,140.680,556.720" size="13.310">A</text>
<text font="Arial" bbox="140.680,543.410,147.350,556.720" size="13.310">P</text>
<text font="Arial" bbox="147.350,543.410,152.910,556.720" size="13.310">0</text>
<text font="Arial" bbox="152.910,543.410,158.470,556.720" size="13.310">2</text>
<text font="Arial" bbox="158.470,543.410,164.030,556.720" size="13.310">2</text>
<text font="Arial" bbox="164.030,543.410,169.590,556.720" size="13.310">4</text>
<text font="Arial" bbox="169.590,543.410,175.150,556.720" size="13.310">8</text>
<text font="Arial" bbox="175.150,543.410,180.710,556.720" size="13.310">8</text>
<text font="Arial" bbox="180.710,543.410,184.040,556.720" size="13.310">-</text>
<text font="Arial" bbox="184.040,543.410,189.600,556.720" size="13.310">0</text>
<text font="Arial" bbox="189.600,543.410,195.160,556.720" size="13.310">2</text>
<text font="Arial" bbox="195.160,543.410,200.720,556.720" size="13.310">1</text>
<text font="Arial" bbox="200.720,543.410,206.280,556.720" size="13.310">4</text>
</figure>
<figure name="Xi4" bbox="472.390,409.720,542.730,422.510">
<text font="Arial" bbox="474.390,409.200,479.950,422.510" size="13.310">1</text>
<text font="Arial" bbox="479.950,409.200,485.510,422.510" size="13.310">0</text>
<text font="Arial" bbox="485.510,409.200,488.290,422.510" size="13.310">/</text>
<text font="Arial" bbox="488.290,409.200,493.850,422.510" size="13.310">2</text>
<text font="Arial" bbox="493.850,409.200,499.410,422.510" size="13.310">9</text>
<text font="Arial" bbox="499.410,409.200,502.190,422.510" size="13.310">/</text>
<text font="Arial" bbox="502.190,409.200,507.750,422.510" size="13.310">2</text>
<text font="Arial" bbox="507.750,409.200,513.310,422.510" size="13.310">0</text>
<text font="Arial" bbox="513.310,409.200,518.870,422.510" size="13.310">1</text>
<text font="Arial" bbox="518.870,409.200,524.430,422.510" size="13.310">6</text>
</figure>
<figure name="Xi5" bbox="36.440,439.990,293.520,452.780">
<text font="Arial" bbox="38.440,439.470,45.110,452.780" size="13.310">A</text>
<text font="Arial" bbox="45.110,439.470,50.670,452.780" size="13.310">u</text>
<text font="Arial" bbox="50.670,439.470,55.670,452.780" size="13.310">s</text>
<text font="Arial" bbox="55.670,439.470,58.450,452.780" size="13.310">t</text>
<text font="Arial" bbox="58.450,439.470,60.670,452.780" size="13.310">i</text>
<text font="Arial" bbox="60.670,439.470,66.230,452.780" size="13.310">n</text>
<text font="Arial" bbox="66.230,439.470,69.010,452.780" size="13.310">,</text>
<text font="Arial" bbox="69.010,439.470,71.790,452.780" size="13.310"> </text>
<text font="Arial" bbox="71.790,439.470,77.900,452.780" size="13.310">T</text>
<text font="Arial" bbox="77.900,439.470,84.570,452.780" size="13.310">X</text>
<text font="Arial" bbox="84.570,439.470,87.350,452.780" size="13.310"> </text>
<text font="Arial" bbox="87.350,439.470,92.910,452.780" size="13.310">7</text>
<text font="Arial" bbox="92.910,439.470,98.470,452.780" size="13.310">8</text>
<text font="Arial" bbox="98.470,439.470,104.030,452.780" size="13.310">7</text>
<text font="Arial" bbox="104.030,439.470,109.590,452.780" size="13.310">0</text>
<text font="Arial" bbox="109.590,439.470,115.150,452.780" size="13.310">5</text>
</figure>

2 个答案:

答案 0 :(得分:1)

使用 xml模块。我还将xml内容封装在数据中,仅用于演示。

<强>演示:

s = """
<data>
<figure name="Xi2" bbox="251.570,374.910,334.710,387.700">
<text font="Arial" bbox="253.570,374.390,259.130,387.700" size="13.310">$</text>
<text font="Arial" bbox="259.130,374.390,264.690,387.700" size="13.310">1</text>
<text font="Arial" bbox="264.690,374.390,267.470,387.700" size="13.310">,</text>
<text font="Arial" bbox="267.470,374.390,273.030,387.700" size="13.310">0</text>
<text font="Arial" bbox="273.030,374.390,278.590,387.700" size="13.310">0</text>
<text font="Arial" bbox="278.590,374.390,284.150,387.700" size="13.310">0</text>
<text font="Arial" bbox="284.150,374.390,286.930,387.700" size="13.310">,</text>
<text font="Arial" bbox="286.930,374.390,292.490,387.700" size="13.310">0</text>
<text font="Arial" bbox="292.490,374.390,298.050,387.700" size="13.310">0</text>
<text font="Arial" bbox="298.050,374.390,303.610,387.700" size="13.310">0</text>
</figure>
<figure name="Xi3" bbox="124.790,543.930,319.200,556.720">
<text font="Arial" bbox="126.790,543.410,134.010,556.720" size="13.310">C</text>
<text font="Arial" bbox="134.010,543.410,140.680,556.720" size="13.310">A</text>
<text font="Arial" bbox="140.680,543.410,147.350,556.720" size="13.310">P</text>
<text font="Arial" bbox="147.350,543.410,152.910,556.720" size="13.310">0</text>
<text font="Arial" bbox="152.910,543.410,158.470,556.720" size="13.310">2</text>
<text font="Arial" bbox="158.470,543.410,164.030,556.720" size="13.310">2</text>
<text font="Arial" bbox="164.030,543.410,169.590,556.720" size="13.310">4</text>
<text font="Arial" bbox="169.590,543.410,175.150,556.720" size="13.310">8</text>
<text font="Arial" bbox="175.150,543.410,180.710,556.720" size="13.310">8</text>
<text font="Arial" bbox="180.710,543.410,184.040,556.720" size="13.310">-</text>
<text font="Arial" bbox="184.040,543.410,189.600,556.720" size="13.310">0</text>
<text font="Arial" bbox="189.600,543.410,195.160,556.720" size="13.310">2</text>
<text font="Arial" bbox="195.160,543.410,200.720,556.720" size="13.310">1</text>
<text font="Arial" bbox="200.720,543.410,206.280,556.720" size="13.310">4</text>
</figure>
<figure name="Xi4" bbox="472.390,409.720,542.730,422.510">
<text font="Arial" bbox="474.390,409.200,479.950,422.510" size="13.310">1</text>
<text font="Arial" bbox="479.950,409.200,485.510,422.510" size="13.310">0</text>
<text font="Arial" bbox="485.510,409.200,488.290,422.510" size="13.310">/</text>
<text font="Arial" bbox="488.290,409.200,493.850,422.510" size="13.310">2</text>
<text font="Arial" bbox="493.850,409.200,499.410,422.510" size="13.310">9</text>
<text font="Arial" bbox="499.410,409.200,502.190,422.510" size="13.310">/</text>
<text font="Arial" bbox="502.190,409.200,507.750,422.510" size="13.310">2</text>
<text font="Arial" bbox="507.750,409.200,513.310,422.510" size="13.310">0</text>
<text font="Arial" bbox="513.310,409.200,518.870,422.510" size="13.310">1</text>
<text font="Arial" bbox="518.870,409.200,524.430,422.510" size="13.310">6</text>
</figure>
<figure name="Xi5" bbox="36.440,439.990,293.520,452.780">
<text font="Arial" bbox="38.440,439.470,45.110,452.780" size="13.310">A</text>
<text font="Arial" bbox="45.110,439.470,50.670,452.780" size="13.310">u</text>
<text font="Arial" bbox="50.670,439.470,55.670,452.780" size="13.310">s</text>
<text font="Arial" bbox="55.670,439.470,58.450,452.780" size="13.310">t</text>
<text font="Arial" bbox="58.450,439.470,60.670,452.780" size="13.310">i</text>
<text font="Arial" bbox="60.670,439.470,66.230,452.780" size="13.310">n</text>
<text font="Arial" bbox="66.230,439.470,69.010,452.780" size="13.310">,</text>
<text font="Arial" bbox="69.010,439.470,71.790,452.780" size="13.310"> </text>
<text font="Arial" bbox="71.790,439.470,77.900,452.780" size="13.310">T</text>
<text font="Arial" bbox="77.900,439.470,84.570,452.780" size="13.310">X</text>
<text font="Arial" bbox="84.570,439.470,87.350,452.780" size="13.310"> </text>
<text font="Arial" bbox="87.350,439.470,92.910,452.780" size="13.310">7</text>
<text font="Arial" bbox="92.910,439.470,98.470,452.780" size="13.310">8</text>
<text font="Arial" bbox="98.470,439.470,104.030,452.780" size="13.310">7</text>
<text font="Arial" bbox="104.030,439.470,109.590,452.780" size="13.310">0</text>
<text font="Arial" bbox="109.590,439.470,115.150,452.780" size="13.310">5</text>
</figure>
</data>"""

import xml.etree.ElementTree as ET
tree = ET.fromstring(s)
stringVal = ""
for i in tree.findall('figure'):
    for j in i.findall('text'):
        stringVal += j.text
print(stringVal)

<强>输出:

$1,000,000CAP022488-021410/29/2016Austin, TX 78705

<强> MoreInfo

答案 1 :(得分:1)

您可以使用BeautifulSoup

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('your_file.xml'), 'lxml')
print soup.get_text().replace('\n\n', ' ').replace('\n', '')

结果:

$1,000,000 CAP022488-0214 10/29/2016 Austin, TX 78705