Question

我是否有任何建议或任何帮助，你可以建议我在文本文件中将简单文本转换为xml文件时进行细分，就像之前在xml中一样。我的意思是，我正在使用jaxp + sax将文本文件转换为xml，如下文所示：

 Hello world. I am happy to see you today.

进入这个xml：

 <trans-unit id="1">
            <target> Hello world</target>
        </trans-unit>
        <trans-unit id="2">
            <target> I am happy to see you today</target>
        </trans-unit>

但是，如果我的源xml内容在 id =“1”中有3个句子，例如：

<trans-unit id="1">
            <source> Hello world. Sunny smile. Wake up early.</source>
        </trans-unit>
        <trans-unit id="2">
            <source> I am happy to see you today</source>
        </trans-unit>

并且我从这个xml解析文本我变成了简单的文本：

Hello world. Sunny smile. Wake up early.I am happy to see you today.

如何将此文本转换为xml，以便目标xml文件可以再次包含3个句子？像：

<trans-unit id="1">
            <target> Hello world. Sunny smile. Wake up early.</target>
        </trans-unit>
        <trans-unit id="2">
            <target> I am happy to see you today</target>
        </trans-unit>

即转换txt-＆gt; xml：

public void doit() {
    try {

        in = new BufferedReader(new InputStreamReader(
                new FileInputStream(file), "UTF8"));
        out = new StreamResult(selectedDir);
        initXML();
        String str;
        while ((str = in.readLine()) != null) {

        elements = str.split("\n|((?<!\\d)\\.(?!\\d))");
        for (i = 0; i < elements.length; i++)
            process(str);

         }
        in.close();
        closeXML();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public void initXML() throws ParserConfigurationException,SAXException, UnsupportedEncodingException, FileNotFoundException, TransformerException {
    // JAXP + SAX
    SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    th = tf.newTransformerHandler();
    Transformer serializer = th.getTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    // XML ausgabe
    serializer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    th.setResult(out);
    th.startDocument();
    atts = new AttributesImpl();
    atts1 = new AttributesImpl();
    atts1.addAttribute("", "", "xlmns","CDATA", "urn:oasis:names:tc:xliff:document:1.2");    
    th.startElement("", "", "xliff", atts1);
    th.startElement("", "", "file",null);
    th.startElement("", "", "body", null);


}

public void process(String s) throws SAXException {
  try {

        atts.clear();
        k++;
        atts.addAttribute("", "", "id", "", "" + k);
        th.startElement("", "", "trans-unit", atts);
        th.startElement("", "", "target", null);
        th.characters(elements[i].toCharArray(), 0, elements[i].length());
        th.endElement("", "", "target");
        th.endElement("", "", "trans-unit");
     }
 catch (Exception e) {
        System.out.print("Out of bounds!");
    }
}
public void closeXML() throws SAXException {
    th.endElement("", "", "body");
    th.endElement("", "", "file");
    th.endElement("", "", "xliff");
    th.endDocument();
}

Answer 1

看起来你的意思是：

String[] segs = elements[i].trim().split("[.!?]\\s+");
for (String seg : segs) {
    atts.clear();
    k++;
    atts.addAttribute("", "", "id", "", "" + k);
    th.startElement("", "", "trans-unit", atts);
    th.startElement("", "", "target", null);
    th.characters(seg.toCharArray(), 0, seg.length());
    th.endElement("", "", "target");
    th.endElement("", "", "trans-unit");
}

获取行尾符号的段以及至少一些空格。

在coomment之后，新的攻击： 不知何故，您需要立即将源xml转换为目标xml。这可以做得非常简单和原始：

    boolean insideSource = false;
    StringBuilder source = null;
    String str;
    while ((str = in.readLine()) != null) {
        if (!inSource) {
            int pos = str.indexOf("<source>");
            if (pos != -1) {
                pos += "<source>".length();
                str = str.substring(0, pos);
                inSource = true;
                source = new StringBuilder();
            }
        }
        if (inSource) {
            int pos = str.indexOf("</source>");
            if (pos == -1) {
                pos = str.length();
            } else {
                inSource = false;
            }
            source.append(str.substring(0, pos));
            if (!inSource) {
                process(source.toString().trim());
                source = null;
            }
        }

第三次尝试： 在Java 7中。

List<String> readSourcesFormXML(Path sourceXML) throws IOException { }

String[] segments(String source) {
    return source.split("(?<[.!?])\\s+"); // Or so
}

List<String> readTranslatedSegments(Path txt) throws IOException {
    return Files.readAllLines(txt, StandardCharsets,UTF_8);
}

void writeTargetsToXML(Path targetXML, Path txt, Path sourceXML) {
    List<String> sources = readSourcesFromPath(sourceXML);
    List<String> translatedSegments = readTranslatedSegments(txt);

    List<String> targets = new ArrayList<>(sources.size());
    int segmentIndex = 0;
    for (String source : sources) {
        String target = "";
        int segmentsPerSource = segments(source).length;
        while (segmentsPerSource > 0) {
            --segmentsPerSource;
            if (!target.isEmpty()) {
                target += " ";
            }
            target += segments.get(segmentIndex];
            ++segmentIndex;
        }
        targets.add(target);
    }

    writeTargetsToXML(targetXML, targets);
}

在Xliff文件中分段文本

1 个答案: