Question

当尝试解析网站的html页面时，它崩溃并出现错误：

java.io.IOException：Mark已失效。

我的部分代码：

String xml = xxxxxx;
try {
    Document document = Jsoup.connect(xml).maxBodySize(1024*1024*10)
            .timeout(0).ignoreContentType(true)
            .parser(Parser.xmlParser()).get();

    Elements elements = document.body().select("td.hotv_text:eq(0)");

    for (Element element : elements) {
        Element element1 = element.select("a[href].hotv_text").first();
        hashMap.put(element.text(), element1.attr("abs:href"));
    }
} catch (HttpStatusException ex) {
    Log.i("GyWueInetSvc", "Exception while JSoup connect:" + xml +" cause:"+ ex.getMessage());
} catch (IOException e) {
    e.printStackTrace();
    throw new RuntimeException("Socket timeout: " + e.getMessage(), e);
}

我要解析的网站大小约为2MB。当我调试代码时，我在java包ConstrainableInputStream.java方法中看到了这一点：

public void reset() throws IOException {
    super.reset();remaining = maxSize - markpos;
}

并返回markpos= -1然后转到例外。

我该如何解决这个问题？

Answer 1

这对我有帮助

GET: .execute().bufferUp().parse();
POST: .method(Connection.Method.POST).execute().bufferUp().parse();

Answer 2

我找到了问题的解决方案。问题在于缓冲区重载。使用以下代码解决：

{{1}}

Answer 3

从1.11.3升级到1.12.2时，我遇到了同样的异常尝试降低您的依赖性

Answer 4

使用~.execute().parse();代替~.get();来获取文档并删除解析器，从而使您的代码成为

Document document = Jsoup.connect(xml).maxBodySize(1024*1024*10)
            .timeout(0).ignoreContentType(true)
            .execute().parse();

这是一个临时性修复，因为我们正在等待可以修复该错误的新版本

Answer 5

添加到@ulong的答案中，避免使用bufferUp（）

如果需要多次分析文档，请在jsoup代码本身的文档中建议

。在解析之前调用BufferUp，这样就不会耗尽InputStream，从而导致无效的标记错误（IOException）

    /**
     * Read and parse the body of the response as a Document. If you intend to parse the same response multiple
     * times, you should {@link #bufferUp()} first.
     * @return a parsed Document
     * @throws IOException on error
     */
    Document parse() throws IOException;

并保留bufferUp（）

    /**
     * Read the body of the response into a local buffer, so that {@link #parse()} may be called repeatedly on the
     * same connection response (otherwise, once the response is read, its InputStream will have been drained and
     * may not be re-read). Calling {@link #body() } or {@link #bodyAsBytes()} has the same effect.
     * @return this response, for chaining
     * @throws UncheckedIOException if an IO exception occurs during buffering.
     */
    Response bufferUp();

java.io.IOException：使用jsoup解析网站时，Mark已失效

5 个答案: