我最近在我的应用程序中将RSS解析从rome迁移到jsoup,当尝试从源解析文件时,Jsoup将无法解析<
和{ {1}}正确,在检索到的>
中导致<
和>
,在尝试使用Document
时会进一步导致问题。
Document::select
上面的代码目前(RSS源不断更新,问题不会发生在本地文件中)打印如下:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import java.io.IOException;
import java.util.Collection;
public class MCVE {
public static void main(final String[] args) throws IOException {
Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
.parser(Parser.xmlParser())
.get()
.select("item")
.stream()
.map(e -> e.select("pubDate"))
.flatMap(Collection::stream)
.map(Element::text)
.forEach(System.out::println);
}
}
这是Jsoup回复给我的Wed, 22 Nov 2017 15:29:54 GMT
Wed, 22 Nov 2017 15:29:43 GMT
Wed, 22 Nov 2017 15:29:36 GMT
Wed, 22 Nov 2017 15:29:28 GMT
Wed, 22 Nov 2017 15:29:22 GMT
Wed, 22 Nov 2017 15:27:23 GMT
Tue, 21 Nov 2017 23:23:23 GMT
Tue, 21 Nov 2017 19:21:38 GMT
Tue, 21 Nov 2017 19:20:12 GMT
Tue, 21 Nov 2017 19:18:15 GMT
Tue, 21 Nov 2017 19:16:17 GMT
Tue, 21 Nov 2017 19:14:37 GMT
Tue, 21 Nov 2017 19:13:34 GMT
Tue, 21 Nov 2017 19:11:33 GMT
Tue, 21 Nov 2017 19:07:49 GMT
Tue, 21 Nov 2017 19:06:56 GMT
Tue, 21 Nov 2017 19:04:19 GMT
Tue, 21 Nov 2017 19:03:57 GMT
Tue, 21 Nov 2017 10:11:11 GMT
Tue, 21 Nov 2017 04:54:00 GMT
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
Mon, 20 Nov 2017 22:22:00 GMT
Mon, 20 Nov 2017 16:16:00 GMT
Mon, 20 Nov 2017 16:15:00 GMT
Mon, 20 Nov 2017 16:14:00 GMT
的片段。
Document
在这里,一些字符被错误地解析,而网站上的xml格式正确。
使用带有斜杠(<item>
<title>Ubuntu Security Notice USN-3483-2</title>
<link>
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</link>
<guid isPermaLink="true">
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</guid>
<comments>
https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html
</comments>
<pubDate>
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> <!-- the affected line -->
<description>
Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
</description>
<category></category>
</pubDate>
</item>
)的相同网址时,问题不会出现在同一页面上,但会在不同的网页上发生。
由于Feed的活动性质,发生问题的Feed页面也会发生变化。如果第18页上的问题未能解决,我将使用新页面进行更新。如果文件单独下载然后使用https://rss.packetstormsecurity.com/files/page18/
进行解析,也不会发生。
Jsoup版本 1.11.2 。
此MCVE显示只有在使用Jsoup解析响应时才会出现问题,实际下载的XML很好:
Jsoup::parse
答案 0 :(得分:0)
这似乎是org.jsoup.helper.HttpConnection::get
和org.jsoup.helper.HttpConnection.Response::parse
,here's my corresponding github issue和here's a repo复制错误的错误。