Question

所以我从一个网站中提取了原始的html代码，但它全部放在一个字符串中，我想把它分成几行，就像谷歌浏览器上的“查看页面源”一样。

这是我的代码。

String url =“https://stratechery.com/2016/how-google-cloud-platform-is-challenging-aws/”; //抓取（网址，“更多完整的Footwear.txt”，9000）;

    System.out.println(br2nl(url));
    Document doc = Jsoup.connect(url)
            .data("query", "Java")
            .userAgent("Mozilla")
            .cookie("auth", "token")
            .timeout(3000)
            .post();
    String rawhtml =doc.toString();
     String lines[] = rawhtml.split("\""+" ");

我试图根据引号和空格拆分“rawhtml”字符串，但它们遍布每一行，所以它在任何地方都进行了拆分。

Answer 1

我想你可能会错过Jsoup的观点。

您不必逐行进行解析，Jsoup有方法可以做到这一点。 HTML已在您创建的JSOUP文档中解析。您现在可以逐个或以分组方式访问其元素。可能性无穷无尽，请查看官方文档：https://jsoup.org/cookbook/

尽管如此，要回答您的问题，要按换行符拆分整个HTML字符串，您可以这样做：

public class JsoupTest {

  public static void main(String[] args) throws IOException {

    String url = "https://stratechery.com/2016/how-google-cloud-platform-is-challenging-aws/";

    Document doc = Jsoup.connect(url)
        .userAgent("Mozilla")
        .get();

    for (String s : doc.toString().split("\\n")) {
      System.out.println(s);
    }
  }
}

在Jsoup中再次将原始html字符串拆分为行

1 个答案: