Jsoup响应:每秒标志都是垃圾(编码问题?)

时间:2014-11-28 21:46:02

标签: java character-encoding jsoup

作为Android应用程序的一部分,我使用了一个基于Java / Jsoup的HTML-Crawler。直到几周前,这种方法确实很好,但是当我解析结果时,我现在收到了非常奇怪的结果。 这是我抓取的页面(所有错误都发生在我登录之前):https://www.stine.uni-hamburg.de

这就是我获取Jsoup Respond对象的方式:

Connection connection = Jsoup
                .connect("https://stine.uni-hamburg.de/")
                .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0")
                .referrer("https://www.stine.uni-hamburg.de/scripts/mgrqispi.dll")
                .method(postData == null ? Connection.Method.GET : Connection.Method.POST)
                .timeout(10000)
                .cookies(m_cookies);
Connection.Response response = connection.execute();

我还尝试删除非绝对必要的连接参数的所有部分,并尝试了一个简单的

Jsoup.parse(new URL("https://www.stine.uni-hamburg.de), 10000);

System.out.println("resp: " + response);
System.out.println("status: " + response.statusCode());
System.out.println("content-type: " + response.contentType());
System.out.println("header: " + response.headers().toString());
System.out.println("content: " + response.parse().body().toString());

使用任何方法,结果如下所示:

resp: org.jsoup.helper.HttpConnection$Response@756535fa
status: 200
content-type: text/html
header: {ETag="09cd9ca88eccf1:0", Date=Fri, 28 Nov 2014 20:53:49 GMT, Vary=Accept-Encoding, Content-Length=1210, Last-Modified=Mon, 20 Oct 2014 17:10:48 GMT, Content-Encoding=gzip, Accept-Ranges=bytes, Content-Type=text/html, X-Powered-By=ASP.NET, Server=Microsoft-IIS/7.5}
content: ��<��!��D��O��C��T��Y��P��E�� ��H��T��M��L�� ��P��U��B��L��I��C�� ��"��-��/��/��W��3��C��/��/��D��T��D�� ��H��T��M��L�� ��4��.��0��1��/��/��E��N��"�� ��"��h��t��t��p��:��/��/��w��w��w��.��w��3��.��o��r��g��/��T��R��/��h��t��m��l��4��/��s��t��r��i��c��t��.��d��t��d��"��>��
��<��h��t��m��l��>��
��  ��<��h��e��a��d��>��
��  ��
��  ��<��!��-��-��
��  ����� ��D��A��T��E��N��L��O��T��S��E��N�� ��I��N��F��O��R��M��A��T��I��O��N��S��S��Y��S��T��E��M��E�� ��G��M��B��H��
��  ��e��-��m��a��i��l��:�� ��  ��  ��i��n��f��o��@��d��a��t��e��n��l��o��t��s��e��n��.��d��e��
��  ��w��e��b��:�� ��   ��  ��  ��h��t��t��p��:��/��/��w��w��w��.��d��a��t��e��n��l��o��t��s��e��n��.��d��e��
��  ��
��  ��c��u��s��t��o��m��e��r��:�� ��    ��  ��u��h��h��
��  ��v��e��r��s��i��o��n��:�� ��   ��  ��5��.��4��0��.��0��0��8��
��  ��f��i��l��e��n��a��m��e��:��   ��  ��i��n��d��e��x��.��h��t��m��
��/��/��-��-��>��
��
��  ��
��  ��  ��<��t��i��t��l��e��>��U��n��i��v��e��r��s��i��t�����t�� ��H��a��m��b��u��r��g��<��/��t��i��t��l��e��>��
��  ��  ��  ��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��X��-��U��A��-��C��o��m��p��a��t��i��b��l��e��"�� ��c��o��n��t��e��n��t��=��"��I��E��=��E��m��u��l��a��t��e��I��E��9��"�� ��/��>�� ��<��!��-��-�� ��I��E��9�� ��d��o��c��u��m��e��n��t�� ��m��o��d��e�� ��o��n��l��y�� ��-��-��>��
��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��c��a��c��h��e��-��c��o��n��t��r��o��l��"��  �� ��   ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��e��x��p��i��r��e��s��"�� �� ��  ��  ��  ��c��o��n��t��e��n��t��=��"��-��1��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� ��    ��  ��  ��  ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� ��    ��  ��  ��  ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� ��    ��  ��  ��  ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��
��  ��  ��<��m��e��t��a�� ��n��a��m��e��=��"��v��i��e��w��p��o��r��t��"�� ��c��o��n��t��e��n��t��=��"��w��i��d��t��h��=��d��e��v��i��c��e��-��w��i��d��t��h��,�� ��i��n��i��t��i��a��l��-��s��c��a��l��e��=��1��,��u��s��e��r��-��s��c��a��l��a��b��l��e��=��0��"�� ��/��>��
��  ��
��  ��  ��  ��  ��
��  ��  ��<��l��i��n��k�� ��r��e��l��=��"��a��p��p��l��e��-��t��o��u��c��h��-��i��c��o��n��"�� ��h��r��e��f��=��"��/��g��f��x��/��u��h��h��/��i��c��o��n��s��/��i��p��h��o��n��e��_��t��o��u��c��h��_��i��c��o��n��.��p��n��g��"�� ��t��y��p��e��=��"��i��m��a��g��e��/��g��i��f��"�� ��/��>��
��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��r��e��f��r��e��s��h��"�� ��c��o��n��t��e��n��t��=��"��0��;�� ��U��R��L��=��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��1��"�� ��/��>��
��  ��  ��  ��  ��
��  ��  ��<��l��i��n��k�� �� �� ��h��r��e��f��=��"��/��c��s��s��/��_��d��e��f��a��u��l��t��/��d��l��.��s��t��a��r��t��p��a��g��e��.��c��s��s��"��   ��  ��  ��r��e��l��=��"��s��t��y��l��e��s��h��e��e��t��"�� �� ��t��y��p��e��=��"��t��e��x��t��/��c��s��s��"��   ��/��>��
��  ��  ��
��  ��  ��<��s��c��r��i��p��t�� ��t��y��p��e��=��"��t��e��x��t��/��j��a��v��a��s��c��r��i��p��t��"�� ��s��r��c��=��"��/��j��s��/��m��o��b��i��l��e��_��m��a��s��t��e��r��/��j��q��u��e��r��y��.��j��s��"��>��<��/��s��c��r��i��p��t��>��
��  ��  ��<��s��c��r��i��p��t�� ��t��y��p��e��=��"��t��e��x��t��/��j��a��v��a��s��c��r��i��p��t��"�� ��s��r��c��=��"��/��j��s��/��m��o��b��i��l��e��_��m��a��s��t��e��r��/��o��n��m��e��d��i��a��q��u��e��r��y��.��m��i��n��.��j��s��"��>��<��/��s��c��r��i��p��t��>��
��
��  ��<��/��h��e��a��d��>��
��  ��
��  ��<��b��o��d��y��>��    ��  ��
��  ��  ��<��d��i��v�� ��i��d��=��"��w��r��a��p��p��e��r��"��>��
��  ��  ��  ��<��a�� ��h��r��e��f��=��"��h��t��t��p��:��/��/��w��w��w��.��u��n��i��-��h��a��m��b��u��r��g��.��d��e��"�� ��t��i��t��l��e��=��"��e��x��t��e��r��n�� ��w��w��w��.��u��n��i��-��h��a��m��b��u��r��g��.��d��e��"��>��
��  ��  ��  ��  ��<��i��m��g�� ��b��o��r��d��e��r��=��"��0��"�� ��i��d��=��"��l��o��g��o��"�� ��s��r��c��=��"��/��g��f��x��/��u��h��h��/��l��o��g��o��.��p��n��g��"�� ��a��l��t��=��"��L��o��g��o�� ��U��n��i��v��e��r��s��i��t�����t�� ��H��a��m��b��u��r��g��"�� ��/��>��
��  ��  ��  ��<��/��a��>��
��  ��  ��  ��
��  ��  ��  ��<��u��l�� ��i��d��=��"��l��a��n��g��M��e��n��u��"��>��
��  ��  ��  ��  �� �� ��<��!��-��-�� ��/��/�� ��F��O��R��W��A��R��D��I��N��G�� ��0��0��1�� ��G��e��r��m��a��n�� ��/��/�� ��-��-��>��
��  ��  ��  ��  ��  ��<��l��i��>��<��a�� ��c��l��a��s��s��=��"��i��m��g�� ��i��m��g��_��L��a��n��g��G��e��r��m��a��n��"�� ��h��r��e��f��=��"��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��1��"��>��d��e��<��/��a��>��<��/��l��i��>��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  �� ��<��!��-��-�� ��/��/�� ��F��O��R��W��A��R��D��I��N��G�� ��0��0��2�� ��E��n��g��l��i��s��h�� ��/��/�� ��-��-��>��
��  ��  ��  ��  ��  ��<��l��i��>��<��a�� ��c��l��a��s��s��=��"��i��m��g�� ��i��m��g��_��L��a��n��g��E��n��g��l��i��s��h��"�� ��h��r��e��f��=��"��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��2��"��>��e��n��<��/��a��>��<��/��l��i��>��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  ��  ��  ��  ��<��/��u��l��>��
��  ��  ��  ��
��  ��  ��<��/��d��i��v��>��
��  ��<��/��b��o��d��y��>��
��<��/��h��t��m��l��>��
��

感谢您的帮助。

编辑:我注意到其他内容:如果我直接检索基本网址(即&#34; https://www.stine.uni-hamburg.de&#34;),则只会出现此错误。另一方面,如果我在BASE_URL / scripts / mgrqispi.dll上调用Jsoup,我会收到一个有效的结果(使用相同的设置)。 但是我还需要基本URL正确呈现,因为页面使用此转发烦恼来创建会话。

2 个答案:

答案 0 :(得分:0)

我想我之前有类似的东西,如果我能找到代码,我会在这里发布。

但是,它可能与字符编码有关。看一下可能将其作为输入流读取,并明确编码设置

//from docs - http://jsoup.org/apidocs/
parse(InputStream inputstream, String charsetName, String baseUri) 


//example
Jsoup.parse(in , "ISO-8859-2", url);

您想尝试将Url作为输入流阅读。可能是这样的东西?

InputStream inputstream =new URL(url).openStream();

答案 1 :(得分:0)

我现在能够通过使用已爬网页面转发到的第一个URL作为脚本的入口点来解决此问题。我仍然不明白这种奇怪的反应是如何发生的。如果我使用了错误的字符编码,我会期望某些字符无法显示但不是这样的......但是,因为我现在正在运行程序,所以我认为这个线程是关闭的。如果您认为自己可能知道一个解决方案,请随意发布,我会对其进行测试,以便对未来的读者有所帮助。