Question

我正在编写一些Java代码，以便使用Wikipedia在文本上实现NLP任务。如何使用JSoup提取维基百科文章的所有文本（例如http://en.wikipedia.org/wiki/Boston中的所有文本）？

Answer 1

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result

当然，您可以通过这种方式检索格式化内容。如果您需要“原始”内容，可以使用Jsoup.clean过滤结果，或使用contentDiv.text()来电。

Answer 2

Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    Element lastParagraph = paragraphs.last();
    Element p;
    int i=1;
    p=firstParagraph;
    System.out.println(p.text());
    while (p!=lastParagraph){
        p=paragraphs.get(i);
        System.out.println(p.text());
        i++;
    }

Answer 3

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);

Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;

System.out.println(iamcontaningIDofintendedTAG.toString());

OR

Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;

System.out.println(iamcontaningCLASSofintendedTAG.toString());

jsoup - 从维基百科文章中提取文本

3 个答案: