我正在尝试使用java从FTP文件下载gzip压缩文件(~390Mo)。但是阅读几行后程序就停止了。
以下是此问题的最小程序:
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
public class Test
{
public static void main(String args[]) throws Exception
{
int count=0;
URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
String line;
BufferedReader in= new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));
while((line=in.readLine())!=null)
{
++count;
System.err.println("["+count+"] "+line);
}
in.close();
System.out.println("Done. nLines="+count);
}
}
编译并运行:
javac Test.java
java -Dftp.proxyHost=${MYPROXYHOST} -Dftp.proxyPort=${MYPROXYPORT} Test
输出在第1012行后过早停止:
(...)
[999] 1 750138 rs61770171 G A . PASS DP=2189;AF=0.083;CB=UM,BI;EUR_R2=0.129;AFR_R2=0.164
[1000] 1 750153 . T C . PASS DP=2555;AF=0.016;CB=UM,BI,BC;EUR_R2=0.167;AFR_R2=0.281
[1001] 1 750190 . C T . PASS DP=3515;AF=0.003;CB=UM,BI;EUR_R2=0.581;AFR_R2=0.575
[1002] 1 750235 . G A . PASS DP=3914;AF=0.019;CB=UM,BI,BC;EUR_R2=0.719;AFR_R2=0.733
[1003] 1 750436 . C T . PASS DP=598;AF=0.020;CB=BI,BC;EUR_R2=0.144;AFR_R2=0.355
[1004] 1 750511 . G A . PASS DP=806;AF=0.010;CB=BI,BC;AFR_R2=0.352
[1005] 1 750718 . G A . PASS DP=2751;AF=0.003;CB=UM,BI,BC;EUR_R2=0.54;AFR_R2=0.545
[1006] 1 750897 . G A . PASS DP=744;AF=0.010;CB=BI,BC;AFR_R2=0.479
[1007] 1 750946 . A G . PASS DP=873;AF=0.010;CB=BI,BC;AFR_R2=0.414
[1008] 1 751043 . G A . PASS DP=1522;AF=0.000;CB=BI,BC;EUR_R2=0.273
[1009] 1 751281 . T C . PASS DP=403;AF=0.010;CB=BI,BC;AFR_R2=0.178
[1010] 1 751343 . T A . PASS DP=1912;AF=0.117;CB=UM,BI;EUR_R2=0.683;AFR_R2=0.582
[1011] 1 751456 . T C . PASS DP=1775;AF=0.008;CB=UM,BI;EUR_R2=0.515;AFR_R2=0.332
[1012] 1
Done. nLines=1012
为什么?发生了什么事?
感谢您的帮助。
皮尔
编辑:我也尝试过使用InputStream而不是Reader。它不起作用:
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
public class Test
{
public static void main(String args[]) throws Exception
{
URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
String line;
byte array[]=new byte[10];
int nRead=0;
InputStream in= new GZIPInputStream(url.openStream());
while((nRead=in.read(array))!=-1)
{
System.out.write(array,0,nRead);
}
in.close();
System.out.println("Done.");
}
}
答案 0 :(得分:2)
ftp.1000genomes.ebi.ac.uk使用gzipInputStream无法处理的特殊形式的gzip压缩(参见http://biostar.stackexchange.com/questions/6112/i-cant-download-a-file-from-the-1k-genomes-ftp-site/6114#6114)
使用net.sf.samtools.util.BlockCompressedInputStream代替GZipInputStream可以解决问题:
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import net.sf.samtools.util.BlockCompressedInputStream;
public class Test
{
public static void main(String args[]) throws Exception
{
URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
String line;
int nRead=0;
BufferedReader in= new BufferedReader(new InputStreamReader(new BlockCompressedInputStream(url.openStream())));
while((line=in.readLine())!=null)
{
System.out.println(line);
}
in.close();
System.out.println("Done.");
}
}
答案 1 :(得分:1)
不要使用readline,而是将文件读入byte []!否则你会遇到很多讨厌的String转换错误!
byte[] buf = new byte[4096];
int bytesRead;
while( (bytesRead = in.read(buf)) >= 0 ) {
outFile.write(buf,0,bytesRead);
}
答案 2 :(得分:0)
好的,这可能会有所帮助:上面的输出长度(直到第1012行)正好是65 536字节长。奇怪的巧合不是吗? 在.vcf文件上尝试此代码:
FileInputStream in = new FileInputStream("ALL.2of4intersection.20100804.sites.vcf");
in.skip(65534);
for (int i=0; i<10; i++) {
System.out.println("byte [" + (65534 + i) + "] = " + in.read());
}
我得到以下输出:
byte [65534] = 49
byte [65535] = 9
byte [65536] = -1
byte [65537] = -1
byte [65538] = -1
byte [65539] = -1
byte [65540] = -1
byte [65541] = -1
byte [65542] = -1
byte [65543] = -1
此外,如果您尝试以下命令:
head -2070 ALL.2of4intersection.20100804.sites.vcf >test.vcf
你只得到前65个536个字节