我需要处理大型gzip压缩文本文件。
InputStream is = new GZIPInputStream(new FileInputStream(path));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
someComputation();
}
如果我不在循环内做任何长计算(我必须这样做),这段代码就可以工作。但是每行只添加几毫秒的睡眠会导致程序最终因java.util.zip.ZipException而崩溃。异常的消息每次都不同(“无效的文字/长度代码”,“无效的块类型”,“无效的存储块长度”)。
因此,当我没有足够快地阅读它时,似乎流被破坏了。
我可以毫无问题地解压缩文件。我也尝试过Apache Commons Compress的GzipCompressorInputStream,结果相同。
这里有什么问题,如何解决?
更新1
我以为我已经排除了这一点,但是做了更多测试,我发现问题仅限于从互联网流式传输文件。
完整示例:
URL source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "gzip, deflate");
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
String line;
while ((line = br.readLine()) != null) { //exception is thrown here
Thread.sleep(5);
}
有趣的是,当我打印行号时,我发现它始终是程序崩溃的四五行不同的行之一。
更新2
这是一个包含实际文件的完整示例:
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.zip.GZIPInputStream;
public class TestGZIPStreaming {
public static void main(String[] args) throws IOException {
URL source = new URL("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "gzip, deflate");
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
String line;
int n = 0;
while ((line = br.readLine()) != null) { //exception is thrown here
Thread.sleep(10);
System.out.println(++n);
}
}
}
对于此文件,崩溃出现在90000行附近。
为了排除超时问题,我尝试了connection.setReadTimeout(0)
- 没有效果。
这可能是一个网络问题。但由于我可以在浏览器中下载文件,因此必须有办法处理它。
更新3
我尝试使用Apache HttpClient进行连接。
HttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
get.addHeader("Accept-Encoding", "gzip");
HttpResponse response = client.execute(get);
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(new BufferedInputStream(response.getEntity().getContent()))));
现在我收到以下异常,这可能更有帮助。
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 3850131; received: 1581056
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
同样,必须有一种方法来处理这个问题,因为我可以在浏览器中下载文件并解压缩而没有任何问题。