Question

我通过向Google发送查询获得了一组1000页（链接）。我正在使用JSoup。我想摆脱图像，链接，菜单，视频等，只从每个页面中获取主要文章。

我的问题是每个页面都有不同的DOM树，因此我不能对每个页面使用相同的命令！你知道如何同时为1000页做这个吗？我想我必须使用正则表达式。这样的事可能

textdoc.body().select("[id*=main]").text();//get id that contains the word main
textdoc.body().select("[class*=main]").text();//get class that contains the word main
textdoc.body().select("[id*=content]").text();//get id that contains the word content

但我觉得我总会怀念这件事。有更好的想法吗？

Answer 1

Element main = doc.select("div.main").first();
Elements links = main.select("a[href]");

所有不同的页面都有主要文章的主要类别？

使用JSoup从多个页面中仅获取文本

1 个答案: