Jsoup在极少数情况下无法解析元素

时间:2017-12-20 12:31:44

标签: java xml rss jsoup

我最近在我的应用程序中将RSS解析从迁移到,当尝试从源解析文件时,Jsoup将无法解析<和{ {1}}正确,在检索到的>中导致&lt;&gt;,在尝试使用Document时会进一步导致问题。

MCVE

Document::select

上面的代码目前(RSS源不断更新,问题不会发生在本地文件中)打印如下:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;

import java.io.IOException;
import java.util.Collection;

public class MCVE {
    public static void main(final String[] args) throws IOException {
        Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
             .parser(Parser.xmlParser())
             .get()
             .select("item")
             .stream()
             .map(e -> e.select("pubDate"))
             .flatMap(Collection::stream)
             .map(Element::text)
             .forEach(System.out::println);
    }
}

这是Jsoup回复给我的Wed, 22 Nov 2017 15:29:54 GMT Wed, 22 Nov 2017 15:29:43 GMT Wed, 22 Nov 2017 15:29:36 GMT Wed, 22 Nov 2017 15:29:28 GMT Wed, 22 Nov 2017 15:29:22 GMT Wed, 22 Nov 2017 15:27:23 GMT Tue, 21 Nov 2017 23:23:23 GMT Tue, 21 Nov 2017 19:21:38 GMT Tue, 21 Nov 2017 19:20:12 GMT Tue, 21 Nov 2017 19:18:15 GMT Tue, 21 Nov 2017 19:16:17 GMT Tue, 21 Nov 2017 19:14:37 GMT Tue, 21 Nov 2017 19:13:34 GMT Tue, 21 Nov 2017 19:11:33 GMT Tue, 21 Nov 2017 19:07:49 GMT Tue, 21 Nov 2017 19:06:56 GMT Tue, 21 Nov 2017 19:04:19 GMT Tue, 21 Nov 2017 19:03:57 GMT Tue, 21 Nov 2017 10:11:11 GMT Tue, 21 Nov 2017 04:54:00 GMT Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed. Mon, 20 Nov 2017 22:22:00 GMT Mon, 20 Nov 2017 16:16:00 GMT Mon, 20 Nov 2017 16:15:00 GMT Mon, 20 Nov 2017 16:14:00 GMT 的片段。

Document

在这里,一些字符被错误地解析,而网站上的xml格式正确。

使用带有斜杠(<item> <title>Ubuntu Security Notice USN-3483-2</title> <link> https://packetstormsecurity.com/files/145055/USN-3483-2.txt </link> <guid isPermaLink="true"> https://packetstormsecurity.com/files/145055/USN-3483-2.txt </guid> <comments> https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html </comments> <pubDate> Tue, 21 Nov 2017 04:04:00 GMT&lt;/pubDate&gt; <!-- the affected line --> <description> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed. </description> <category></category> </pubDate> </item> )的相同网址时,问题不会出现在同一页面上,但会在不同的网页上发生。

由于Feed的活动性质,发生问题的Feed页面也会发生变化。如果第18页上的问题未能解决,我将使用新页面进行更新。如果文件单独下载然后使用https://rss.packetstormsecurity.com/files/page18/进行解析,也不会发生。

Jsoup版本 1.11.2

附加MCVE

此MCVE显示只有在使用Jsoup解析响应时才会出现问题,实际下载的XML很好:

Jsoup::parse

1 个答案:

答案 0 :(得分:0)

这似乎是org.jsoup.helper.HttpConnection::getorg.jsoup.helper.HttpConnection.Response::parsehere's my corresponding github issuehere's a repo复制错误的错误。

This will be fixed in Jsoup 1.11.3