我正在解析网站视图来源:https://massive.ucsd.edu/ProteoSAFe/datasets.jsp。我想解析.jsp并从中提取JSOn对象。
我正在使用Jsoup来提取数据
Document doc = Jsoup.connect("https://massive.ucsd.edu/ProteoSAFe/datasets.jsp").maxBodySize(0).get();
然后使用Java模式将Json提取为字符串:
Pattern p = Pattern.compile(String.format("\"%s\":\\s*(.*),", "dataset","\"%s\":\\s*(.*),", "datasetNum","\"%s\":\\s*(.*),", "title","\"%s\":\\s*(.*),", "user","\"%s\":\\s*(.*),", "site","\"%s\":\\s*(.*),", "flowname","\"%s\":\\s*(.*),", "createdMillis","\"%s\":\\s*(.*),", "created","\"%s\":\\s*(.*),", "fileCount","\"%s\":\\s*(.*),", "fileSizeKB","\"%s\":\\s*(.*),", "psms","\"%s\":\\s*(.*),", "peptides","\"%s\":\\s*(.*),", "variants","\"%s\":\\s*(.*),", "proteins","\"%s\":\\s*(.*),", "species","\"%s\":\\s*(.*),", "instrument","\"%s\":\\s*(.*),", "modification","\"%s\":\\s*(.*),", "pi","\"%s\":\\s*(.*),", "complete","\"%s\":\\s*(.*),", "status","\"%s\":\\s*(.*),", "private","\"%s\":\\s*(.*),", "hash","\"%s\":\\s*(.*),", "px","\"%s\":\\s*(.*),", "task","\"%s\":\\s*(.*),", "id"));
Matcher m = p.matcher(script.html());
这样做我收到错误。最后一行未正确解析。 它最终削减了所以我得到了
' JSONObject文本必须以'}'结尾。在角色577'错误。
任何人都可以建议我更好地解析此页面以获取数据。
答案 0 :(得分:1)
虽然用正则表达式解析任何HTML似乎是个坏主意。
这对我有用Pattern.compile("(?s)var datasets = (\\[.*?\\]);")
(通过Python测试,因为这是我所有的)。
这会返回JSONArray
,而不是JSONObject
。