pdf表单到csv python或类似的

时间:2014-10-08 20:09:56

标签: python xml csv pdf converter

我有一堆用Adobe formcentral创建的pdf表单 - 它们的格式都相同,我想将字段中的数据提取到CSV文件中。我(稍微)熟悉python,并尝试了一些库来通过XML标签提取文本。尽管如此,我已经达到了我的目的:(

我已经设法通过' pdfquery'来阅读PDF。和/或' beautifulsoup'但无法在任何地方找到一个简单的教程来帮助我将pdf解析为csv / excel。我搜索了SO,似乎找不到任何完全相关的东西。我设法提取的XML树为我提供了字段名称的标签(见下文),但不知道如何从这里开始。有没有人有这种操作的经验,或者能够指出我的任何教程的方向。

感激不尽的任何帮助!

由于

玛蒂

    <pdfxml ModDate="D:20140414114502+03'00'" CreationDate="D:20140407143830-04'00'" Producer="Adobe FormsCentral 889953 S" Creator="Adobe FormsCentral 738134">
  <LTPage bbox="[0, 0, 595.27, 841.89]" height="841.89" pageid="1" rotate="0" width="595.27" x0="0" x1="595.27" y0="0" y1="841.89" page_index="0" page_label="">
    <LTRect bbox="[0.0, 0.0, 595.27, 841.89]" height="841.89" linewidth="0" pts="[[0.0, 0.0], [595.27, 0.0], [595.27, 841.89], [0.0, 841.89]]" width="595.27" x0="0.0" x1="595.27" y0="0.0" y1="841.89">
      <LTTextLineHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" width="99.816" word_margin="0.1" x0="34.015" x1="133.831" y0="732.217" y1="745.798"><LTTextBoxHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" index="1" width="99.816" x0="34.015" x1="133.831" y0="732.217" y1="745.798">Name of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextLineHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" width="94.724" word_margin="0.1" x0="34.015" x1="128.739" y0="707.554" y1="721.135"><LTTextBoxHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" index="2" width="94.724" x0="34.015" x1="128.739" y0="707.554" y1="721.135">Type of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 631.024, 136.667, 657.37]" height="26.347" index="3" width="102.642" x0="34.025" x1="136.667" y0="631.024" y1="657.37"><LTTextLineHorizontal bbox="[34.025, 643.789, 136.667, 657.37]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="643.789" y1="657.37">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 631.024, 112.269, 645.166]" height="14.143" width="78.244" word_margin="0.1" x0="34.025" x1="112.269" y0="631.024" y1="645.166">members (male): </LTTextLineHorizontal></LTTextBoxHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 581.871, 136.667, 620.462]" height="38.592" index="4" width="102.642" x0="34.025" x1="136.667" y0="581.871" y1="620.462"><LTTextLineHorizontal bbox="[34.025, 606.881, 136.667, 620.462]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="606.881" y1="620.462">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 594.116, 134.963, 608.259]" height="14.143" width="100.938" word_margin="0.1" x0="34.025" x1="134.963" y0="594.116" y1="608.259">members aged 18-35 </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 581.871, 64.076, 596.014]" height="14.143" width="30.051" word_margin="0.1" x0="34.025" x1="64.076" y0="581.871" y1="596.014">(male) </LTTextLineHorizontal></LTTextBoxHorizontal>
      <LTTextLineHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" width="78.836" word_margin="0.1" x0="34.025" x1="112.861" y0="557.728" y1="571.31"><LTTextBoxHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" index="5" width="78.836" x0="34.025" x1="112.861" y0="557.728" y1="571.31">Location/Address </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 494.974, 138.371, 533.045]" height="38.071" index="6" width="104.346" x0="34.025" x1="138.371" y0="494.974" y1="533.045"><LTTextLineHorizontal bbox="[34.025, 519.463, 99.821, 533.045]" height="13.582" width="65.795" word_margin="0.1" x0="34.025" x1="99.821" y0="519.463" y1="533.045">Type of waste </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 507.218, 138.371, 520.8]" height="13.582" width="104.346" word_margin="0.1" x0="34.025" x1="138.371" y0="507.218" y1="520.8">management activities </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 494.974, 85.066, 508.555]" height="13.582" width="51.04" word_margin="0.1" x0="34.025" x1="85.066" y0="494.974" y1="508.555">carried out: </LTTextLineHorizontal></LTTextBoxHorizontal>

1 个答案:

答案 0 :(得分:0)

我更喜欢使用lxml package,因为它有一个非常方便的客观化模块,使得解析XML非常简单。

这是一个示例,展示了从XML中提取数据的几种方法:

from lxml import objectify

#----------------------------------------------------------------------
def parser(xml):
    """"""
    root = objectify.fromstring(xml)
    print root.LTPage.LTRect.attrib
    for item in root.LTPage.LTRect.getchildren():
        print item.tag
        print item.text
        print item.attrib
        print item.attrib["bbox"]

if __name__ == "__main__":
    xml = """<pdfxml ModDate="D:20140414114502+03'00'" CreationDate="D:20140407143830-04'00'" Producer="Adobe FormsCentral 889953 S" Creator="Adobe FormsCentral 738134">
  <LTPage bbox="[0, 0, 595.27, 841.89]" height="841.89" pageid="1" rotate="0" width="595.27" x0="0" x1="595.27" y0="0" y1="841.89" page_index="0" page_label="">
    <LTRect bbox="[0.0, 0.0, 595.27, 841.89]" height="841.89" linewidth="0" pts="[[0.0, 0.0], [595.27, 0.0], [595.27, 841.89], [0.0, 841.89]]" width="595.27" x0="0.0" x1="595.27" y0="0.0" y1="841.89">
      <LTTextLineHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" width="99.816" word_margin="0.1" x0="34.015" x1="133.831" y0="732.217" y1="745.798"><LTTextBoxHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" index="1" width="99.816" x0="34.015" x1="133.831" y0="732.217" y1="745.798">Name of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextLineHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" width="94.724" word_margin="0.1" x0="34.015" x1="128.739" y0="707.554" y1="721.135"><LTTextBoxHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" index="2" width="94.724" x0="34.015" x1="128.739" y0="707.554" y1="721.135">Type of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 631.024, 136.667, 657.37]" height="26.347" index="3" width="102.642" x0="34.025" x1="136.667" y0="631.024" y1="657.37"><LTTextLineHorizontal bbox="[34.025, 643.789, 136.667, 657.37]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="643.789" y1="657.37">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 631.024, 112.269, 645.166]" height="14.143" width="78.244" word_margin="0.1" x0="34.025" x1="112.269" y0="631.024" y1="645.166">members (male): </LTTextLineHorizontal></LTTextBoxHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 581.871, 136.667, 620.462]" height="38.592" index="4" width="102.642" x0="34.025" x1="136.667" y0="581.871" y1="620.462"><LTTextLineHorizontal bbox="[34.025, 606.881, 136.667, 620.462]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="606.881" y1="620.462">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 594.116, 134.963, 608.259]" height="14.143" width="100.938" word_margin="0.1" x0="34.025" x1="134.963" y0="594.116" y1="608.259">members aged 18-35 </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 581.871, 64.076, 596.014]" height="14.143" width="30.051" word_margin="0.1" x0="34.025" x1="64.076" y0="581.871" y1="596.014">(male) </LTTextLineHorizontal></LTTextBoxHorizontal>
      <LTTextLineHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" width="78.836" word_margin="0.1" x0="34.025" x1="112.861" y0="557.728" y1="571.31"><LTTextBoxHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" index="5" width="78.836" x0="34.025" x1="112.861" y0="557.728" y1="571.31">Location/Address </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 494.974, 138.371, 533.045]" height="38.071" index="6" width="104.346" x0="34.025" x1="138.371" y0="494.974" y1="533.045"><LTTextLineHorizontal bbox="[34.025, 519.463, 99.821, 533.045]" height="13.582" width="65.795" word_margin="0.1" x0="34.025" x1="99.821" y0="519.463" y1="533.045">Type of waste </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 507.218, 138.371, 520.8]" height="13.582" width="104.346" word_margin="0.1" x0="34.025" x1="138.371" y0="507.218" y1="520.8">management activities </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 494.974, 85.066, 508.555]" height="13.582" width="51.04" word_margin="0.1" x0="34.025" x1="85.066" y0="494.974" y1="508.555">carried out: </LTTextLineHorizontal></LTTextBoxHorizontal>
    </LTRect>
    </LTPage>
    </pdfxml>
      """
    parser(xml)

请注意,我修改了XML以获得正确的结束标记。您可能还会发现本教程很有用: