当使用JSoup解析html时,如果文本字符串中有新的行字符,则将其视为不存在。考虑一下:This string of text will wrap
here because of a new line character
。但是当JSoup解析此字符串时,它返回This string of text will wraphere because of a new line character
。请注意,换行符甚至不会成为空格。我只想让它以空格返回。这是节点中的文本。我已经在stackoverflow上看到了其他解决方案,人们想要或不希望在标记之后换行。那不是我想要的。我只是想知道我是否可以修改解析函数以返回不忽略换行符。
答案 0 :(得分:0)
你可以尝试一下getWholeText,根据这里的答案:Prevent Jsoup from discarding extra whitespace
/**
* @param cell element that contains whitespace formatting
* @return
*/
public static String getText(Element cell) {
String text = null;
List<Node> childNodes = cell.childNodes();
if (childNodes.size() > 0) {
Node childNode = childNodes.get(0);
if (childNode instanceof TextNode) {
text = ((TextNode)childNode).getWholeText();
}
}
if (text == null) {
text = cell.text();
}
return text;
}
答案 1 :(得分:0)
我明白了。我从网址获取html时犯了一个错误。我正在使用这种方法:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line;
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
当我应该使用以下内容时:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line + "/n";
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
这个问题与JSoup无关。我想我会在这里记下它,因为我从使用Java的Instant Web Scraping复制了这段代码 由Ryan Mitchell和本教程之后的任何其他人提出同样的问题。