我正在尝试将网页源代码的文本作为字符串进行解析。结果是网站的html格式有些含糊,但文字却毫无意义。我是在教程中这样做的,而讲师给出的源代码也给了我同样的问题。对于我尝试的每个站点,它也会持续存在。我的计算机/互联网连接可能有问题吗?
记录结果:
07-26 17:29:49.143 10863-10863/org.andrewedgar.downloadwebcontent I/Result: !otp tl
<-[fl E7> hm ls=n-sl-e ti8l-e"ln=" !edf-><-[fI ] hm ls=n-sl-e ti8 ag"><[ni]-
!-i E8> <tlcas"oj ti9 ag"><[ni]-
!-i tI ]<-><tlcas"oj"ln=e" !-!edf-> <ed
mt hre=uf8> <eanm=vepr"cnet"it=eiewdh nta-cl="
mt ae"ecito"cnet"omi ooulaplnigpg hr nld wsm adn aedms"
mt ae"uhr otn=Wwhmz>
tteZpyoe/il>
ln e=sotu cn ye"mg/-cn rf"sai/m/aio.n"
<- otAeoeCS-> <ikrl"tlset rf"sai/s/otaeoemncs> <- hmf cn S -
ln e=syehe"he=/ttccsteiyioscs> <- lgn otIosCS-> <ikrl"tlset rf"sai/s/lgn-otioscs> <- lgn ieIosCS-> <ikrl"tlset rf"sai/s/lgn-ieioscs> <- otta S -
ln e=syehe"he=/ttccsbosrpmncs> <- lcnvCS-> <ikrl"tlset rf"sai/s/lcnvmncs> <- nmt S -
ln e=syehe"he=/ttccsaiaemncs> <- eoo S -
ln e=syehe"he=/ttccsvnbxvnbxcs> <- W-aoslCS-> <ikrl"tlset rf"sai/s/w.aoslcs> <- anCS-> <ikrl"tlset rf"sai/s/ancs> <- epnieCS-> <ikrl"tlset rf"sai/s/epniecs>
srp r=/ttcj/edrmdrir283rsod142mnj"<srp> <ha> <oydt-p=srl"dt-agt"nveu aaofe=7"
!-i tI ]
pcas"rweugae>o r sn n<togotae<srn>bosr lae< rf"tp/boshpycm"ugaeyu rwe<a oipoeyu xeine<p
!edf->
dvi=peodr
dvcas'odr
dvcas"atr"<dv
/i> <dv<- rlae -
<edri=hae"cas"edrscin> <i ls=cnanr> <a ls=nva"
ahe=# ls=nva-rn"<m d"rnLg"sc"sai/m/apCdLgWtTx.n"at"apcd"<a
dvcas"-lxmn-rp> dvi=nveu ls=mimn"
<lcas"a"
l < aasrl ls=nvln cie rf"hm"Hm sa ls=s-ny>cret<sa>/>/i
/l
<dv
dvcas"eubn> < rf"tp:/er.apcd.o"cas"utn1>er<a
/i> <dv
/a> <dv
/edr !-Hae -
<eto d"oe ls=hr_eto rdat1pdig> <i ls=dslytbe> <i ls=tbecl"
dvcas"otie"
dvcas"eocnet> <1Lancd h<rfnwy/1
pPormigdenthv ob oigtdosadfutan.b>oehv oefnadlanhwt oe<p
ahe=hts/lanzpyoecm ls=bto_"LanNw/> <dv
/i>
/i> <dv
/eto>!-Hr eto -
<- QeyLb-> <citsc"sai/svno/qey11..i.s>/cit
!-BosrpJ -
srp r=/ttcj/edrbosrpmnj"<srp> <- ehrJ -
srp r=/ttcj/edrtte.i.s>/cit
!-wyonsj -
srp r=/ttcj/edrjur.apit.203mnj"<srp> <- lcnvJ -
srp r=/ttcj/edrjur.lcnvmnj"<srp> <- W-aoslJ -
srp r=/ttcj/edrolcrue.i.s>/cit
!-CutrpJ -
srp r=/ttcj/edrjur.oneu.i.s>/cit
!-Sot colJ -
srp r=/ttcj/edrsot-colmnj"<srp> <- edrJ -
srp r=/ttcj/edrvnbxmnj"<srp> <- jxhm S-> <citsc"sai/svno/qeyaacipmnj"<srp> <- o S-> <citsc"sai/svno/o.i.s>/cit
!-Mi S-> <citsc"sai/smi.s>/cit
<bd><hm>
代码:
public class DownloadTask extends AsyncTask<String, Void, String> {
@Override
protected String doInBackground(String... urls) {
String result = "";
URL url;
HttpURLConnection urlConnection = null;
try {
url = new URL(urls[0]);
urlConnection = (HttpURLConnection) url.openConnection();
InputStream in = urlConnection.getInputStream();
InputStreamReader reader = new InputStreamReader(in);
int data = reader.read();
while (data != -1) {
data = reader.read();
char current = (char) data;
result += current;
data = reader.read();
}
return result;
} catch (Exception e) {
e.printStackTrace();
return "Failed";
}
}
}
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
DownloadTask task = new DownloadTask();
String result = null;
try {
result = task.execute("http://www.zappycode.com").get();
} catch (Exception e) {
e.printStackTrace();
}
Log.i("Result", result);
}
}
答案 0 :(得分:1)
您每次迭代从流中读取两次:
while (data != -1) {
data = reader.read(); // <<- here
char current = (char) data;
result += current;
data = reader.read(); // <<- and here
}
但是仅将结果追加一次。因此,您最终只会得到奇数字符。 这样的事情应该起作用:
while((int data = reader.read) != -1) result += (char) data
但是,总的来说,从输入中读取原始字节并将其转换为字符不是一个好主意。这样的东西会更健壮:
BufferedReader br = new BufferedReader(reader)
StringBuilder accumulator = new StringBuilder()
while((String line = br.readLine()) != null) accumulator
.append(line)
.append(System.lineSeparator)
答案 1 :(得分:0)
看来您的代码正在读取原始的8位ASCII字符并显示它们。该网站可能使用不同的字符编码(请参见this Wikipedia article on encoding)。而不是逐字节读取,而是使用缓冲的读取器并使Java将一系列编码后的字节转换为String。 @xtratic指出了StackOverflow上的另一个答案,该答案的代码示例将在此处工作:How to read an http input stream。