Question

当使用JSoup解析html时，如果文本字符串中有新的行字符，则将其视为不存在。考虑一下：This string of text will wrap here because of a new line character。但是当JSoup解析此字符串时，它返回This string of text will wraphere because of a new line character。请注意，换行符甚至不会成为空格。我只想让它以空格返回。这是节点中的文本。我已经在stackoverflow上看到了其他解决方案，人们想要或不希望在标记之后换行。那不是我想要的。我只是想知道我是否可以修改解析函数以返回不忽略换行符。

Answer 1

你可以尝试一下getWholeText，根据这里的答案：Prevent Jsoup from discarding extra whitespace

/**
 * @param cell element that contains whitespace formatting
 * @return
 */
public static String getText(Element cell) {
    String text = null;
    List<Node> childNodes = cell.childNodes();
    if (childNodes.size() > 0) {
        Node childNode = childNodes.get(0);
        if (childNode instanceof TextNode) {
            text = ((TextNode)childNode).getWholeText();
        }
    }
    if (text == null) {
        text = cell.text();
    }
    return text;
}

Answer 2

我明白了。我从网址获取html时犯了一个错误。我正在使用这种方法：

public static String getUrl(String url) {
    URL urlObj = null;
    try{
        urlObj = new URL(url);
    }
    catch(MalformedURLException e) {
        System.out.println("The url was malformed!");
        return "";
    }
    URLConnection urlCon = null;
    BufferedReader in = null;
    String outputText = "";
    try{
        urlCon = urlObj.openConnection();
        in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
        String line = "";
        while((line = in.readLine()) != null){
            outputText += line;
        }
        in.close();
    }
    catch(IOException e){
        System.out.println("There was an error connecting to the URL");
        return "no";
        }
    return outputText;
}

当我应该使用以下内容时：

public static String getUrl(String url) {
    URL urlObj = null;
    try{
        urlObj = new URL(url);
    }
    catch(MalformedURLException e) {
        System.out.println("The url was malformed!");
        return "";
    }
    URLConnection urlCon = null;
    BufferedReader in = null;
    String outputText = "";
    try{
        urlCon = urlObj.openConnection();
        in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
        String line = "";
        while((line = in.readLine()) != null){
            outputText += line + "/n";
        }
        in.close();
    }
    catch(IOException e){
        System.out.println("There was an error connecting to the URL");
        return "no";
        }
    return outputText;
}

这个问题与JSoup无关。我想我会在这里记下它，因为我从使用Java的Instant Web Scraping复制了这段代码由Ryan Mitchell和本教程之后的任何其他人提出同样的问题。

Jsoup中的新行字符处理

2 个答案: