不使用COM /自动化从Word文档中提取文本的最佳方法?

时间:2008-09-03 20:18:47

标签: python ms-word

是否有合理的方法从不依赖于COM自动化的Word文件中提取纯文本? (这是部署在非Windows平台上的Web应用程序的一项功能 - 在这种情况下是不可协商的。)

Antiword似乎可能是一个合理的选择,但它似乎可能会被抛弃。

Python解决方案是理想的,但似乎不可用。

10 个答案:

答案 0 :(得分:20)

(与extracting text from MS word files in python相同的答案)

使用我本周制作的原生Python docx模块。以下是如何从doc中提取所有文本:

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

请参阅Python DocX site

100%Python,没有COM,没有.net,没有Java,没有解析带有正则表达式的序列化XML,没有废话。

答案 1 :(得分:12)

我使用catdoc或antiword,无论什么给出最容易解析的结果。我已经在python函数中嵌入了它,所以它很容易从解析系统(用python编写)中使用。

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

-w切换到catdoc会关闭换行,BTW。

答案 2 :(得分:4)

如果您只想从Word文件(.docx)中提取文本,则只能使用Python进行操作。就像Guy Starbuck写的那样,你只需要解压缩文件然后解析XML。受python-docx的启发,我写了simple function来执行此操作:

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

答案 3 :(得分:3)

使用OpenOffice API,Python和Andrew Pitonyak's excellent online macro book我设法做到了这一点。第7.16.4节是开始的地方。

使用它而不需要屏幕的另一个提示是使用隐藏属性:

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

否则当您打开文档时,文档会在屏幕上(可能在Web服务器控制台上)轻弹。

答案 4 :(得分:1)

Open Office有一个API

答案 5 :(得分:1)

对于docx文件,请查看

中提供的Python脚本docx2txt

http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

用于从docx文档中提取纯文本。

答案 6 :(得分:0)

tika-python

Apache Tika库的Python端口,根据文档,Apache tika支持从1500多种文件格式中提取文本。

注意:它也可以与 pyinstaller

使用pip安装:

    #include <TinyGPS++.h>
TinyGPSPlus gps;
double latitude, longitude;

#include <SoftwareSerial.h>
SoftwareSerial SIM800L(7, 8);

String response;
int lastStringLength = response.length();

String link;

void setup() {
  Serial.begin(9600);
  Serial.println("GPS Mulai");

    SIM800L.begin(9600);  
    SIM800L.println("AT+CMGF=1");
    Serial.println("SIM800L started at 9600");
    delay(1000);
    Serial.println("Setup Complete! SIM800L is Ready!");
    SIM800L.println("AT+CNMI=2,2,0,0,0");
    //latitude = gps.location.lat();
    //longitude = gps.location.lng();
}

void loop() {
  GPS();
  Serial.print(String(latitude)+","+String(longitude));
  if (SIM800L.available()>0){
      response = SIM800L.readStringUntil('\n');
    }


  if(lastStringLength != response.length()){


      //Perintah ON
      if(response.indexOf("ON") == 4){

          SIM800L.println("AT+CMGF=1");    //Sets the GSM Module in Text Mode
          delay(1000);  // Delay of 1000 milli seconds or 1 second
          SIM800L.println("AT+CMGS=\"082232949301\"\r"); // Replace x with mobile number
          delay(1000);
          SIM800L.println(link);// The SMS text you want to send
          delay(100);
          SIM800L.println((char)26);// ASCII code of CTRL+Z
          delay(1000);
      }
  }



}

void GPS(){
  if(Serial.available()) {
    gps.encode(Serial.read());
  }
  if(gps.location.isUpdated()) {
    latitude = gps.location.lat();
    longitude = gps.location.lng();
    link = "www.google.com/maps/place/" + String(latitude, 6) + "," + String(longitude, 6) ;
    Serial.println(link);

  }
}

示例:

pip install tika

链接到官方GitHub

答案 7 :(得分:0)

老实说不要使用“ pip install tika ”,它是为单用户(一个使用笔记本电脑工作的开发人员)而不是多用户(多个开发人员)开发的。 / p>

在命令行中使用Tika的小类TikaWrapper.py波纹管足以满足我们的需求。

您只需要使用JAVA_HOME路径和Tika jar路径实例化此类,仅此而已!它非常适合许多格式(例如PDF,DOCX,ODT,XLSX,PPT等)。

#!/bin/python
# -*- coding: utf-8 -*-

# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:

    java_home = None
    tikalib_path = None

    # Constructor
    def __init__(self, java_home, tikalib_path):
        self.java_home = java_home
        self.tika_lib_path = tikalib_path

    def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract metadata from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          metadata = extractMetadata(filePath="MyDocument.docx")
          metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        if (returnTuple): return out, err
        return out

    def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract text from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          text = extractText(filePath="MyDocument.docx")
          text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        return out, err

    # ===========
    # = PRIVATE =
    # ===========

    _cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
    _cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"

    def _getCmd(self, cmdModel, filePath, encoding):
        cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
        cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
        cmd = cmd.replace("${ENCODING}", encoding)
        cmd = cmd.replace("${FILE_PATH}", filePath)
        return cmd

    def _execute(self, cmd, encoding):
        import subprocess
        process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = process.communicate()
        out = out.decode(encoding=encoding)
        err = err.decode(encoding=encoding)
        return out, err

答案 8 :(得分:0)

只要有人想用Java语言编写,就可以使用Apache poi api。 extractor.getText()将从docx中提取平面文本。这是链接https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm

答案 9 :(得分:-1)

This worked well代表.doc和.odt。

它在命令行上调用openoffice将文件转换为文本,然后可以将其加载到python中。

(它似乎有其他格式选项,但它们并没有详细记录。)