Question

我使用以下代码从.odt文件中提取文本：

public class OpenOfficeParser {

StringBuffer TextBuffer;

public OpenOfficeParser() {}

//Process text elements recursively
public void processElement(Object o) {

    if (o instanceof Element) {

        Element e = (Element) o;
        String elementName = e.getQualifiedName();

        if (elementName.startsWith("text")) {

            if (elementName.equals("text:tab")) // add tab for text:tab
                TextBuffer.append("\\t");
            else if (elementName.equals("text:s"))  // add space for text:s
                TextBuffer.append(" ");
            else {
                List children = e.getContent();
                Iterator iterator = children.iterator();

                while (iterator.hasNext()) {

                    Object child = iterator.next();
                    //If Child is a Text Node, then append the text
                    if (child instanceof Text) { 
                        Text t = (Text) child;
                        TextBuffer.append(t.getValue());
                    }
                    else
                    processElement(child); // Recursively process the child element                   
                }                   
            }
            if (elementName.equals("text:p"))
                TextBuffer.append("\\n");                   
        }
        else {
            List non_text_list = e.getContent();
            Iterator it = non_text_list.iterator();
            while (it.hasNext()) {
                Object non_text_child = it.next();
                processElement(non_text_child);                   
            }
        }               
    }
}

public String getText(String fileName) throws Exception {
    TextBuffer = new StringBuffer();

    //Unzip the openOffice Document
    ZipFile zipFile = new ZipFile(fileName);
    Enumeration entries = zipFile.entries();
    ZipEntry entry;

    while(entries.hasMoreElements()) {
        entry = (ZipEntry) entries.nextElement();

        if (entry.getName().equals("content.xml")) {

            TextBuffer = new StringBuffer();               
            SAXBuilder sax = new SAXBuilder();
            Document doc = sax.build(zipFile.getInputStream(entry));
            Element rootElement = doc.getRootElement();
            processElement(rootElement);
            break;
        }
    }    


 System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
        return TextBuffer.toString();       
    }     
}

现在使用getText()方法返回的字符串时出现问题。我运行程序并从.odt中提取了一些文本，这是一段提取的文本：

(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

所以我试过这个

System.out.println( TextBuffer.toString().split("\\n"));

我收到的输出是：

substring: [Ljava.lang.String;@505bb829

我也试过这个：

System.out.println( TextBuffer.toString().trim() );

但打印的字符串没有变化。

为什么会这样？我该怎么做才能正确解析该字符串？并且，如果我想添加到array [i]每个以“\ n \ n”结尾的子字符串，我该怎么办？

修改：抱歉，我在示例中犯了一个错误，因为我忘了split()返回一个数组。问题是它返回一个包含一行的数组，所以我问的是为什么这样做：

System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));

对我在示例中写的字符串没有影响。

还有：

    System.out.println( TextBuffer.toString().trim() );

对原始字符串没有影响，只打印原始字符串。

我想举例说明我想使用split()的原因，这是因为我想要解析该字符串并将每个以“\ n”结尾的子字符串放在数组行中，这是一个例子：

我的原始字符串：

    (no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

解析后我会打印一个数组的每一行，输出应为：

line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....

Answer 1

如果我理解你的问题，我会做这样的事情

String str = "(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....";

List<String> al = new ArrayList<String>(Arrays.asList(str.toString()
            .split("\\n")));

al.removeAll(Arrays.asList("", null)); // remove empty or null string

for (int i = 0; i< al.size(); i++) {
    System.out.println("Line " + i + " : " + al.get(i).trim());
}

<强>输出

Line 0 : (no hi virtual x oy)
Line 1 : house cat
Line 2 : open it
Line 3 : trying to....

使用默认方法解析字符串

1 个答案: