我是否有任何建议或任何帮助,你可以建议我在文本文件中将简单文本转换为xml文件时进行细分,就像之前在xml中一样。我的意思是,我正在使用jaxp + sax将文本文件转换为xml,如下文所示:
Hello world. I am happy to see you today.
进入这个xml:
<trans-unit id="1">
<target> Hello world</target>
</trans-unit>
<trans-unit id="2">
<target> I am happy to see you today</target>
</trans-unit>
但是,如果我的源xml内容在 id =“1”中有3个句子,例如:
<trans-unit id="1">
<source> Hello world. Sunny smile. Wake up early.</source>
</trans-unit>
<trans-unit id="2">
<source> I am happy to see you today</source>
</trans-unit>
并且我从这个xml解析文本我变成了简单的文本:
Hello world. Sunny smile. Wake up early.I am happy to see you today.
如何将此文本转换为xml,以便目标xml文件可以再次包含3个句子?像:
<trans-unit id="1">
<target> Hello world. Sunny smile. Wake up early.</target>
</trans-unit>
<trans-unit id="2">
<target> I am happy to see you today</target>
</trans-unit>
即转换txt-&gt; xml:
public void doit() {
try {
in = new BufferedReader(new InputStreamReader(
new FileInputStream(file), "UTF8"));
out = new StreamResult(selectedDir);
initXML();
String str;
while ((str = in.readLine()) != null) {
elements = str.split("\n|((?<!\\d)\\.(?!\\d))");
for (i = 0; i < elements.length; i++)
process(str);
}
in.close();
closeXML();
} catch (Exception e) {
e.printStackTrace();
}
}
public void initXML() throws ParserConfigurationException,SAXException, UnsupportedEncodingException, FileNotFoundException, TransformerException {
// JAXP + SAX
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
th = tf.newTransformerHandler();
Transformer serializer = th.getTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
// XML ausgabe
serializer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
th.setResult(out);
th.startDocument();
atts = new AttributesImpl();
atts1 = new AttributesImpl();
atts1.addAttribute("", "", "xlmns","CDATA", "urn:oasis:names:tc:xliff:document:1.2");
th.startElement("", "", "xliff", atts1);
th.startElement("", "", "file",null);
th.startElement("", "", "body", null);
}
public void process(String s) throws SAXException {
try {
atts.clear();
k++;
atts.addAttribute("", "", "id", "", "" + k);
th.startElement("", "", "trans-unit", atts);
th.startElement("", "", "target", null);
th.characters(elements[i].toCharArray(), 0, elements[i].length());
th.endElement("", "", "target");
th.endElement("", "", "trans-unit");
}
catch (Exception e) {
System.out.print("Out of bounds!");
}
}
public void closeXML() throws SAXException {
th.endElement("", "", "body");
th.endElement("", "", "file");
th.endElement("", "", "xliff");
th.endDocument();
}
答案 0 :(得分:0)
看起来你的意思是:
String[] segs = elements[i].trim().split("[.!?]\\s+");
for (String seg : segs) {
atts.clear();
k++;
atts.addAttribute("", "", "id", "", "" + k);
th.startElement("", "", "trans-unit", atts);
th.startElement("", "", "target", null);
th.characters(seg.toCharArray(), 0, seg.length());
th.endElement("", "", "target");
th.endElement("", "", "trans-unit");
}
获取行尾符号的段以及至少一些空格。
在coomment之后,新的攻击: 不知何故,您需要立即将源xml转换为目标xml。这可以做得非常简单和原始:
boolean insideSource = false;
StringBuilder source = null;
String str;
while ((str = in.readLine()) != null) {
if (!inSource) {
int pos = str.indexOf("<source>");
if (pos != -1) {
pos += "<source>".length();
str = str.substring(0, pos);
inSource = true;
source = new StringBuilder();
}
}
if (inSource) {
int pos = str.indexOf("</source>");
if (pos == -1) {
pos = str.length();
} else {
inSource = false;
}
source.append(str.substring(0, pos));
if (!inSource) {
process(source.toString().trim());
source = null;
}
}
第三次尝试: 在Java 7中。
List<String> readSourcesFormXML(Path sourceXML) throws IOException { }
String[] segments(String source) {
return source.split("(?<[.!?])\\s+"); // Or so
}
List<String> readTranslatedSegments(Path txt) throws IOException {
return Files.readAllLines(txt, StandardCharsets,UTF_8);
}
void writeTargetsToXML(Path targetXML, Path txt, Path sourceXML) {
List<String> sources = readSourcesFromPath(sourceXML);
List<String> translatedSegments = readTranslatedSegments(txt);
List<String> targets = new ArrayList<>(sources.size());
int segmentIndex = 0;
for (String source : sources) {
String target = "";
int segmentsPerSource = segments(source).length;
while (segmentsPerSource > 0) {
--segmentsPerSource;
if (!target.isEmpty()) {
target += " ";
}
target += segments.get(segmentIndex];
++segmentIndex;
}
targets.add(target);
}
writeTargetsToXML(targetXML, targets);
}