java Jsoup问题如何按词拆分?

时间:2018-12-02 16:12:57

标签: java jsoup

我想获取不带标签的html内容,结果为

public class PreProcessing {

    public static void main(String\[\] args) throws Exception {

        PrintWriter out = new PrintWriter("filename.txt");

        URL url = new URL("[https://en.wikipedia.org/wiki/Distributed\_computing](https://en.wikipedia.org/wiki/Distributed_computing)");

        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

        String inputLine = "";

        String input = "";


        while ((inputLine = in.readLine()) != null)

        {
            input += inputLine;
            //          System.out.println(inputLine);
        }

        //create Jsoup document from HTML

        Document jsoupDoc = Jsoup.parse(input);

        //set pretty print to false, so \\n is not removed

        jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));

        //select all <br> tags and append \\n after that

        //        [jsoupDoc.select](https://jsoupDoc.select)("br").after("\\\\n");

        //select all <p> tags and prepend \\n before that

        //        [jsoupDoc.select](https://jsoupDoc.select)("p").before("\\\\n");

        //get the HTML from the document, and retaining original new lines

        String str = jsoupDoc.html().replaceAll(" ", "\n");
        //        str.replaceAll("\t", "");

        String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
        strWithNewLines.replaceAll("\t", "\n");
        strWithNewLines.replaceAll("\\"", "");

        strWithNewLines.replaceAll(".", "");

        System.out.println(strWithNewLines);

        out.print(strWithNewLines);
    }
}

所以我尝试了以下方法。

en.wiki~ distributed_computin

这是我尝试BufferedReader g并从jsoupDoc读取并使用" "的代码,我想将单词"\n"替换为word \n word\n word\n,因为我想要像这样Distributed computing - Wikipedia Distributed computing From Wikipedia, the free encyclopedia Jump to navigation Jump to search "Distributed application" redirects here. For trustless applications, see

那么结果就是

Distributed

computing

-

Wikipedia

Distributed

computing

From

Wikipedia

the

free

encyclopedia

Jump

to

navigation

Jump

to

search

Distributed

application

redirects

here

For

trustless

applications

see

但是我想要这样的结果

strWithNewLines.replaceAll("\\"", "");

strWithNewLines.replaceAll(".", "");

我尝试过

import requests
import json
import simplejson

r =     requests.get('http://localhost:5000/get_jobs/accurate,detail%20oriented,n     mbers,finance,analyst,optimistic,emotional%20intelligence,positive,calm,resilient,stable,committed,competitive,ambitious,determined,targets,goal-oriented,quick%20learner').json()

f = open('ProfileA.json', 'w')
simplejson.dump({'this is a test on Profile A with the following words: accurate,detail%20oriented,numbers,finance,analyst,optimistic,emotional%20intelligence,positive,calm,resilient,stable,committed,competitive,ambitious,determined,targets,goal-oriented,quick%20learner' : r}, f,sort_keys = True, indent = 4)
f.close()

但是这没有用。为什么不起作用?我确实进行了谷歌搜索,但找不到解决方案。

1 个答案:

答案 0 :(得分:1)

在最后几行尝试此操作。这将使您更接近所需的结果:

String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
    .replaceAll("\"", "");
    //.replaceAll(".", "");

System.out.println(result);

您的代码中的问题是String是不可变的,因此String.replaceAll不会替换原始String中的任何内容,但是会在替换完成后产生一个新字符串。但是您永远不会使用结果。

.replaceAll(".", "")有问题。这将为您提供一个空字符串,因为.与每个字符匹配,并且将被一个空字符串替换。