Question

Upd：我正在使用Jsoup来解析文本
在解析一个站点时，我遇到了问题：当我获得HTML文本时，一些链接在随机位置被破坏了空间。例如：

What a pretty flower! <a href="www.goo gle.com/...">here</a> and <a href="w ww.google.com...">here</a>

正如您可能注意到的那样，空间位置是完全随机的，但有一点是肯定的：它位于href标签内。当然，我可以使用replace(" ", "")方法，但可能有两个或更多链接。我该如何解决这个问题？

Answer 1

这是一种旧的解决方案，但我尝试使用旧的退役apache ECS来解析你的html，然后，只有href链接，你可以删除空格，然后重新创建所有内容： - ）如果我记得很清楚，有一种方法可以解析html中的ECS“DOM”。

http://svn.apache.org/repos/asf/jakarta/ecs/branches/ecs/src/java/org/apache/ecs/html2ecs/Html2Ecs.java

另一个选择是选择性地使用像xpath之类的东西来获取你的href，但是你必须处理格式错误的HTML（你可以给Tidy一个机会 - http://infohound.net/tidy/）

Answer 2

您可以使用正则表达式来查找和＆＃34;细化＆＃34;网址：

public class URLRegex {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {

        final String INPUT = "Hello World <a href=\"http://ww w.google.com\">Google</a> Second " + 
                             "Hello World <a href=\"http://www.wiki pedia.org\">Wikipedia</a> Test" + 
                             "<a href=\"https://www.example.o rg\">Example</a> Test Test";
        System.out.println(INPUT);

        // This pattern matches a sequence of one or more spaces.
        // Precompile it here, so we don't have to do it in every iteration of the loop below.
        Pattern SPACES_PATTERN = Pattern.compile("\\u0020+");

        // The regular expression below is very primitive and does not really check whether the URL is valid.
        // Moreover, only very simple URLs are matched. If an URL includes different protocols, account credentials, ... it is not matched.
        // For more sophisticated regular expressions have a look at: http://stackoverflow.com/questions/161738/
        Pattern PATTERN_A_HREF = Pattern.compile("https?://[A-Za-z0-9\\.\\-\\u0020\\?&\\=#/]+");
        Matcher m = PATTERN_A_HREF.matcher(INPUT);

        // Iterate through all matching strings:
        while (m.find()) {
            String urlThatMightContainSpaces = m.group();   // Get the current match
            Matcher spaceMatcher = SPACES_PATTERN.matcher(urlThatMightContainSpaces);
            System.out.println(spaceMatcher.replaceAll(""));  // Replaces all spaces by nothing.
        }

    }
}

使用未知子字符串更改困难字符串

2 个答案: