Question

我一直在尝试在java中执行此操作，但无法实现此目的。我在谷歌上搜索了四种方法。他们是：

1. jsoup: Java HTML Parser
2. Apache Nutch
3. Chrome extension 
4. https://github.com/yasserg/crawler4j   google crwler

有人可以通过使用一些有效的代码来指导我。例如。

Let's say Given URL is google.com

然后输出应该是

Sign In
  Gmail
  Images
  Google Search
  I'm Feeling Lucky
  Google.co.in offered in
  हिन्दी
  ગુજરાતી
  About
  Privacy
  **same way other string that i can see over web page.

Answer 1

我能够使用节点js提取所有文本，这里是脚本第一步＆gt;＆gt;将其保存到文件test.html

var request = require('request');

var cheerio = require('cheerio');

request('https://www.bajajallianz.com/Corp/new-index.jsp', function (error, response, html) {

  if (!error && response.statusCode == 200) {

    console.log(html);

  }


});

第二步

cat test.html | html-to-text > test.txt

如何收集给定网站上的所有可用字符串

1 个答案: