Question

我有一点HTML：

<div class="content" itemprop="softwareVersion"> 2.3  </div>

（这是我在Play商店中的应用程序版本）我想要做的是使用模式匹配获取最新版本。

我到目前为止匹配模式的是：

String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> [^ <]*</dd");
Matcher matcher = pattern.matcher(Html);
matcher.find();

我现在如何从htmlString中提取2.3？

Answer 1

使用JSoup xhtml解析器

众所周知，不应该使用正则表达式解析xhtml ，除非您知道要解析的html字符集。您应该使用xhtml解析器，而不是JSoup。所以，你可以使用这样的东西：

 String htmlString = "YOUR HTML HERE";
 Document document=Jsoup.parse(htmlString);
 Element element=document.select("div[itemprop=softwareVersion]").first();
 System.out.println(element.text());

正则表达式方法

但是，如果要使用正则表达式，则必须使用捕获组，然后获取其内容。

String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*)</dd");
                                               //     ^------^ Here
Matcher matcher = pattern.matcher(htmlString);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

Answer 2

尝试在捕获组中捕获它？

（＆＃34; softwareVersion \＆＃34;＆gt;（[^＆lt;] *）＆lt; / dd＆＃34;）;

然后使用matcher.group（1）

访问该值

Answer 3

我不得不调整一些方法来完成这项工作：

String htmlString = "String that includes <div class=\"content\" itemprop=\"softwareVersion\"> 2.3  </div>";
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*) +</div");
Matcher matcher = pattern.matcher(htmlString);
if (matcher.find())
{
    System.out.println(matcher.group(1));
}
//else??

RE中的()可以使用matcher,group(1)

Answer 4

首先，正如评论所指出的，你不能用正则表达式解析HTML（感谢Jeff Burka链接到规范的答案）。

其次，由于您正在查看非常有限且特殊的情况，因此您可以使用捕获组进行匹配以获取版本。

假设有问题的div没有跨行，我的策略就像你发布的尝试一样;查找字符串 softwareVersion 和标记关闭>字符，可选空格，版本字符串，可选空格和结束标记。

这给出了像softwareVersion[^>]*>\s*([0-9.]+)\s*</

这样的正则表达式

来自debuggex（需要.*匹配前导部分）：

.*softwareVersion[^>]*>\s*([0-9.]+)\s*</

Regular expression visualization

Debuggex Demo

这将为您提供捕获组中的版本，该版本为matcher.group(1)

作为Java字符串，即softwareVersion[^>]*>\\s*([0-9.]+)\\s*</

我在div之后省略了</因为，虽然它现在在div中，但也许它将来是一个跨度或其他东西。
我使用[0-9.]变得简单，因此它可以匹配2.3，但也匹配3.0.1，但它也匹配..382.1...33 - 您可以创建一个匹配有限或任意{如果重要的话，{1}}点缀数字。

n(.n)*将版本号 n 与0到3个 .n 点版本匹配，因此3.0.2.1但不是1.2.3.4.5

Answer 5

试试这个正则表达式\"softwareVersion\">\s([0-9].?[0-9]?+)\s\s<\/div>：

\" matches the character " literally
softwareVersion matches the characters softwareVersion literally (case sensitive)
\" matches the character " literally
> matches the characters > literally
\s match any white space character [\r\n\t\f ]
1st Capturing group ([0-9].?[0-9]?+)
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
.? matches any character (except newline)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[0-9]?+ match a single character present in the list below
Quantifier: ?+ Between zero and one time, as many times as possible, without giving back [possessive]
0-9 a single character in the range between 0 and 9
\s match any white space character [\r\n\t\f ]
\s match any white space character [\r\n\t\f ]
< matches the characters < literally
\/ matches the character / literally
div> matches the characters div> literally (case sensitive)

https://regex101.com/r/kR7lC2/1

使用patern matcher提取html

5 个答案:

使用JSoup xhtml解析器

正则表达式方法