Question

我正在尝试在xml标记之间提取文本。标签之间的文本是多语言的。例如：

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

我试图谷歌它并得到一些正则表达但但是没有用这是我尝试过的一个：

String str = "<string xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">"+
    "तुम्हारा नाम क्या है"+"</string>";

final Pattern pattern = Pattern.compile("<String xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">(.+?)</string>");

final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1));

给定的String格式是

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

，预期输出为：

तुम्हारा नाम क्या है

它给了我一个错误

Answer 1

此模式与预期部分匹配，$1为您提供预期结果：

/<string .*?>(.*?)<\\/string>/

Online Demo

但强烈推荐停止使用正则表达式做到这一点..！您必须在JAVA中找到HTML解析器，并且只需获取<string>标记的内容。

Answer 2

不要使用正则表达式来解析XML。它会在少数情况下起作用，但最终会失败。有关完整说明，请参阅Can you provide some examples of why it is hard to parse XML and HTML with a regex?。

提取元素字符串内容的最简单方法是使用XPath：

String contents =
    XPathFactory.newInstance().newXPath().evaluate(
        "//*[local-name()='string']",
        new InputSource(new StringReader(str)));

如何在xml标记之间提取多语言字符串

2 个答案:

Online Demo