Question

我需要处理大型gzip压缩文本文件。

InputStream is = new GZIPInputStream(new FileInputStream(path));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
    someComputation();  
}

如果我不在循环内做任何长计算（我必须这样做），这段代码就可以工作。但是每行只添加几毫秒的睡眠会导致程序最终因java.util.zip.ZipException而崩溃。异常的消息每次都不同（“无效的文字/长度代码”，“无效的块类型”，“无效的存储块长度”）。
因此，当我没有足够快地阅读它时，似乎流被破坏了。

我可以毫无问题地解压缩文件。我也尝试过Apache Commons Compress的GzipCompressorInputStream，结果相同。
这里有什么问题，如何解决？

更新1

我以为我已经排除了这一点，但是做了更多测试，我发现问题仅限于从互联网流式传输文件。

完整示例：

URL source = new URL(url);      
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET"); 
connection.setRequestProperty("Accept", "gzip, deflate"); 
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));        
String line;
while ((line = br.readLine()) != null) { //exception is thrown here
    Thread.sleep(5);  
}

有趣的是，当我打印行号时，我发现它始终是程序崩溃的四五行不同的行之一。

更新2

这是一个包含实际文件的完整示例：

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.zip.GZIPInputStream;


public class TestGZIPStreaming {

    public static void main(String[] args) throws IOException {

        URL source = new URL("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");      
        HttpURLConnection connection = (HttpURLConnection) source.openConnection();
        connection.setRequestMethod("GET"); 
        connection.setRequestProperty("Accept", "gzip, deflate"); 
        BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));       

        String line;
        int n = 0;

        while ((line = br.readLine()) != null) { //exception is thrown here
            Thread.sleep(10);  
            System.out.println(++n);
        }

    }

}

对于此文件，崩溃出现在90000行附近。

为了排除超时问题，我尝试了connection.setReadTimeout(0) - 没有效果。

这可能是一个网络问题。但由于我可以在浏览器中下载文件，因此必须有办法处理它。

更新3

我尝试使用Apache HttpClient进行连接。

HttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
get.addHeader("Accept-Encoding", "gzip");
HttpResponse response = client.execute(get);
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(new BufferedInputStream(response.getEntity().getContent()))));

现在我收到以下异常，这可能更有帮助。

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 3850131; received: 1581056
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)

同样，必须有一种方法来处理这个问题，因为我可以在浏览器中下载文件并解压缩而没有任何问题。

在读取GZIP压缩输入流时关闭HTTP连接

0 个答案: