Question

我想找到＆lt; a＆gt; StringBuilder中的标签（结果）并在其href属性之前插入一个单词（INSERTED-WORD /）。

代码：

Pattern pattern = Pattern.compile("<a [a-zA-Z0-9=\":.;\\s&%_#/\\\\()\\-']*href=['\"]");
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
 int index2 = result.indexOf(matcher.group(0))+ matcher.group(0).length();
 result.insert(index2, "INSERTED-WORD/");
}

但是一些标签被发现两次（或更多），并且在它们的href属性两次或更多之前插入了INSERTED-WORD /。

例如，我想找到这个标签：

＆LT; a class =“link”href =“www.example.com”＆gt; link＆lt; / A＆GT;

然后将其更改为

＆LT; a class =“link”href =“INSERTED-WORD / www.example.com”＆gt; link＆lt; / A＆GT;

。但是这段代码将其更改为

＆LT;一个类=“链接” HREF = “INSERTED-WORD / INSERTED-WORD / INSERTED-WORD / www.example.com” ＆GT; LINK＆LT; / A＆GT;

我该如何解决？

Answer 1

您看到的行为是由使用indexOf引起的。当多次找到某些内容时，indexOf将搜索相同的匹配字符串，并始终返回第一个匹配项的索引。

这不是您的代码唯一的问题。在result使用matcher时，您也会修改Matcher，而result并非旨在解决此问题，并且无法正常运行。一个明显的问题是，它会认为Pattern pattern = Pattern.compile("<a [a-zA-Z0-9=\":.;\\s&%_#/\\\\()\\-']*href=['\"]"); Matcher matcher = pattern.matcher(result.toString()); // Create new String instead of using result int found = 0; while (matcher.find()) { int index2 = matcher.end(); result.insert(index2 + found++ * "INSERTED-WORD/".length(), "INSERTED-WORD/"); }比实际更短，并且可能存在其他问题。

以下内容将修复您的代码：

found

我会留给你弄清楚为什么需要result = new StringBuilder(result.toString().replaceAll("<a [^>]*?href=\"(?!INSERTED-WORD/)", "$0INSERTED-WORD/"));，没有它运行代码，看看会发生什么。

注释

这不是解决问题的好方法，anubhava在评论中提供了一个更简单的解决方案：GET /
解析html的推荐方法是使用html解析器https://jsoup.org/是一个很好的解析器。

为什么模式/匹配器找到一个匹配两次

1 个答案:

注释