是否有合理的方法从不依赖于COM自动化的Word文件中提取纯文本? (这是部署在非Windows平台上的Web应用程序的一项功能 - 在这种情况下是不可协商的。)
Antiword似乎可能是一个合理的选择,但它似乎可能会被抛弃。
Python解决方案是理想的,但似乎不可用。
答案 0 :(得分:20)
(与extracting text from MS word files in python相同的答案)
使用我本周制作的原生Python docx模块。以下是如何从doc中提取所有文本:
document = opendocx('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
100%Python,没有COM,没有.net,没有Java,没有解析带有正则表达式的序列化XML,没有废话。
答案 1 :(得分:12)
我使用catdoc或antiword,无论什么给出最容易解析的结果。我已经在python函数中嵌入了它,所以它很容易从解析系统(用python编写)中使用。
import os
def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)
# similar doc_to_text_antiword()
-w切换到catdoc会关闭换行,BTW。
答案 2 :(得分:4)
如果您只想从Word文件(.docx)中提取文本,则只能使用Python进行操作。就像Guy Starbuck写的那样,你只需要解压缩文件然后解析XML。受python-docx
的启发,我写了simple function来执行此操作:
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
答案 3 :(得分:3)
使用OpenOffice API,Python和Andrew Pitonyak's excellent online macro book我设法做到了这一点。第7.16.4节是开始的地方。
使用它而不需要屏幕的另一个提示是使用隐藏属性:
RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )
否则当您打开文档时,文档会在屏幕上(可能在Web服务器控制台上)轻弹。
答案 4 :(得分:1)
Open Office有一个API
答案 5 :(得分:1)
对于docx文件,请查看
中提供的Python脚本docx2txthttp://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt
用于从docx文档中提取纯文本。
答案 6 :(得分:0)
tika-python
Apache Tika库的Python端口,根据文档,Apache tika支持从1500多种文件格式中提取文本。
注意:它也可以与 pyinstaller
使用pip安装:
#include <TinyGPS++.h>
TinyGPSPlus gps;
double latitude, longitude;
#include <SoftwareSerial.h>
SoftwareSerial SIM800L(7, 8);
String response;
int lastStringLength = response.length();
String link;
void setup() {
Serial.begin(9600);
Serial.println("GPS Mulai");
SIM800L.begin(9600);
SIM800L.println("AT+CMGF=1");
Serial.println("SIM800L started at 9600");
delay(1000);
Serial.println("Setup Complete! SIM800L is Ready!");
SIM800L.println("AT+CNMI=2,2,0,0,0");
//latitude = gps.location.lat();
//longitude = gps.location.lng();
}
void loop() {
GPS();
Serial.print(String(latitude)+","+String(longitude));
if (SIM800L.available()>0){
response = SIM800L.readStringUntil('\n');
}
if(lastStringLength != response.length()){
//Perintah ON
if(response.indexOf("ON") == 4){
SIM800L.println("AT+CMGF=1"); //Sets the GSM Module in Text Mode
delay(1000); // Delay of 1000 milli seconds or 1 second
SIM800L.println("AT+CMGS=\"082232949301\"\r"); // Replace x with mobile number
delay(1000);
SIM800L.println(link);// The SMS text you want to send
delay(100);
SIM800L.println((char)26);// ASCII code of CTRL+Z
delay(1000);
}
}
}
void GPS(){
if(Serial.available()) {
gps.encode(Serial.read());
}
if(gps.location.isUpdated()) {
latitude = gps.location.lat();
longitude = gps.location.lng();
link = "www.google.com/maps/place/" + String(latitude, 6) + "," + String(longitude, 6) ;
Serial.println(link);
}
}
示例:
pip install tika
链接到官方GitHub
答案 7 :(得分:0)
老实说不要使用“ pip install tika ”,它是为单用户(一个使用笔记本电脑工作的开发人员)而不是多用户(多个开发人员)开发的。 / p>
在命令行中使用Tika的小类TikaWrapper.py波纹管足以满足我们的需求。
您只需要使用JAVA_HOME路径和Tika jar路径实例化此类,仅此而已!它非常适合许多格式(例如PDF,DOCX,ODT,XLSX,PPT等)。
#!/bin/python
# -*- coding: utf-8 -*-
# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:
java_home = None
tikalib_path = None
# Constructor
def __init__(self, java_home, tikalib_path):
self.java_home = java_home
self.tika_lib_path = tikalib_path
def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract metadata from a document
- Params:
filePath: The document file path
encoding: The encoding (default = "UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)
- Examples:
metadata = extractMetadata(filePath="MyDocument.docx")
metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
out, err = self._execute(cmd, encoding)
if (returnTuple): return out, err
return out
def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract text from a document
- Params:
filePath: The document file path
encoding: The encoding (default = "UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)
- Examples:
text = extractText(filePath="MyDocument.docx")
text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
out, err = self._execute(cmd, encoding)
return out, err
# ===========
# = PRIVATE =
# ===========
_cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
_cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"
def _getCmd(self, cmdModel, filePath, encoding):
cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
cmd = cmd.replace("${ENCODING}", encoding)
cmd = cmd.replace("${FILE_PATH}", filePath)
return cmd
def _execute(self, cmd, encoding):
import subprocess
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
out = out.decode(encoding=encoding)
err = err.decode(encoding=encoding)
return out, err
答案 8 :(得分:0)
只要有人想用Java语言编写,就可以使用Apache poi api。 extractor.getText()将从docx中提取平面文本。这是链接https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm
答案 9 :(得分:-1)