我想获取不带标签的html内容,结果为
public class PreProcessing {
public static void main(String\[\] args) throws Exception {
PrintWriter out = new PrintWriter("filename.txt");
URL url = new URL("[https://en.wikipedia.org/wiki/Distributed\_computing](https://en.wikipedia.org/wiki/Distributed_computing)");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine = "";
String input = "";
while ((inputLine = in.readLine()) != null)
{
input += inputLine;
// System.out.println(inputLine);
}
//create Jsoup document from HTML
Document jsoupDoc = Jsoup.parse(input);
//set pretty print to false, so \\n is not removed
jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \\n after that
// [jsoupDoc.select](https://jsoupDoc.select)("br").after("\\\\n");
//select all <p> tags and prepend \\n before that
// [jsoupDoc.select](https://jsoupDoc.select)("p").before("\\\\n");
//get the HTML from the document, and retaining original new lines
String str = jsoupDoc.html().replaceAll(" ", "\n");
// str.replaceAll("\t", "");
String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
strWithNewLines.replaceAll("\t", "\n");
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
System.out.println(strWithNewLines);
out.print(strWithNewLines);
}
}
所以我尝试了以下方法。
en.wiki~ distributed_computin
这是我尝试BufferedReader
g并从jsoupDoc
读取并使用" "
的代码,我想将单词"\n"
替换为word \n word\n word\n
,因为我想要像这样Distributed
computing
-
Wikipedia Distributed
computing From
Wikipedia,
the
free
encyclopedia Jump
to
navigation Jump
to
search "Distributed
application"
redirects
here.
For
trustless
applications,
see
。
那么结果就是
Distributed
computing
-
Wikipedia
Distributed
computing
From
Wikipedia
the
free
encyclopedia
Jump
to
navigation
Jump
to
search
Distributed
application
redirects
here
For
trustless
applications
see
但是我想要这样的结果
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
我尝试过
import requests
import json
import simplejson
r = requests.get('http://localhost:5000/get_jobs/accurate,detail%20oriented,n mbers,finance,analyst,optimistic,emotional%20intelligence,positive,calm,resilient,stable,committed,competitive,ambitious,determined,targets,goal-oriented,quick%20learner').json()
f = open('ProfileA.json', 'w')
simplejson.dump({'this is a test on Profile A with the following words: accurate,detail%20oriented,numbers,finance,analyst,optimistic,emotional%20intelligence,positive,calm,resilient,stable,committed,competitive,ambitious,determined,targets,goal-oriented,quick%20learner' : r}, f,sort_keys = True, indent = 4)
f.close()
但是这没有用。为什么不起作用?我确实进行了谷歌搜索,但找不到解决方案。
答案 0 :(得分:1)
在最后几行尝试此操作。这将使您更接近所需的结果:
String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
.replaceAll("\"", "");
//.replaceAll(".", "");
System.out.println(result);
您的代码中的问题是String是不可变的,因此String.replaceAll
不会替换原始String中的任何内容,但是会在替换完成后产生一个新字符串。但是您永远不会使用结果。
.replaceAll(".", "")
有问题。这将为您提供一个空字符串,因为.
与每个字符匹配,并且将被一个空字符串替换。