从jsp中提取JSON为String

时间:2017-04-11 19:15:01

标签: java json jsoup

我正在解析网站视图来源:https://massive.ucsd.edu/ProteoSAFe/datasets.jsp。我想解析.jsp并从中提取JSOn对象。

我正在使用Jsoup来提取数据

Document doc = Jsoup.connect("https://massive.ucsd.edu/ProteoSAFe/datasets.jsp").maxBodySize(0).get();

然后使用Java模式将Json提取为字符串:

Pattern p = Pattern.compile(String.format("\"%s\":\\s*(.*),", "dataset","\"%s\":\\s*(.*),", "datasetNum","\"%s\":\\s*(.*),", "title","\"%s\":\\s*(.*),", "user","\"%s\":\\s*(.*),", "site","\"%s\":\\s*(.*),", "flowname","\"%s\":\\s*(.*),", "createdMillis","\"%s\":\\s*(.*),", "created","\"%s\":\\s*(.*),", "fileCount","\"%s\":\\s*(.*),", "fileSizeKB","\"%s\":\\s*(.*),", "psms","\"%s\":\\s*(.*),", "peptides","\"%s\":\\s*(.*),", "variants","\"%s\":\\s*(.*),", "proteins","\"%s\":\\s*(.*),", "species","\"%s\":\\s*(.*),", "instrument","\"%s\":\\s*(.*),", "modification","\"%s\":\\s*(.*),", "pi","\"%s\":\\s*(.*),", "complete","\"%s\":\\s*(.*),", "status","\"%s\":\\s*(.*),", "private","\"%s\":\\s*(.*),", "hash","\"%s\":\\s*(.*),", "px","\"%s\":\\s*(.*),", "task","\"%s\":\\s*(.*),", "id"));

Matcher m = p.matcher(script.html());

这样做我收到错误。最后一行未正确解析。 它最终削减了所以我得到了

' JSONObject文本必须以'}'结尾。在角色577'错误。

任何人都可以建议我更好地解析此页面以获取数据。

1 个答案:

答案 0 :(得分:1)

虽然用正则表达式解析任何HTML似乎是个坏主意。

这对我有用Pattern.compile("(?s)var datasets = (\\[.*?\\]);")

(通过Python测试,因为这是我所有的)。

这会返回JSONArray,而不是JSONObject