我正在尝试开发一个小型网络抓取工具,可以下载网页并搜索特定部分中的链接。但是当我运行此代码时,“href”标签中的链接会缩短。喜欢:
原始链接:“/ kids-toys-action-figures-accessories / b / ref = toys_hp_catblock_actnfigs?ie = UTF8& node = 165993011& pf_rd_m = ATVPDKIKX0DER& pf_rd_s = merchandised-search-4& pf_rd_r = 267646F4BB25430BAD0D& pf_rd_t = 101安培; pf_rd_p = 1582921042&安培; pf_rd_i = 165793011"
变成:“/ kids-toys-action-figures-accessories / b?ie = UTF8& node = 165993011”
可以请任何人帮助我。下面是我的代码:package test;
import java.io.*;
import java.net.MalformedURLException;
import java.util.*;
public class myFirstWebCrawler {
public static void main(String[] args) {
String strTemp = "";
String dir="d:/files/";
String filename="hello.txt";
String fullname=dir+filename;
try {
URL my_url = new URL("http://www.amazon.com/s/ref=lp_165993011_ex_n_1?rh=n%3A165793011&bbn=165793011&ie=UTF8&qid=1376550433");
BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream(),"utf-8"));
createdir(dir);
while(null != (strTemp = br.readLine())){
writetofile(fullname,strTemp);
System.out.println(strTemp);
}
System.out.println("index of feature category : " + readfromfile(fullname,"Featured Categories"));
} catch (Exception ex) {
ex.printStackTrace();
}
}
public static void createdir(String dirname)
{ File d= new File(dirname);
d.mkdirs();
}
public static void writetofile(String path, String bbyte)
{
try
{
FileWriter filewriter = new FileWriter(path,true);
BufferedWriter bufferedWriter = new BufferedWriter(filewriter);
bufferedWriter.write(bbyte);
bufferedWriter.newLine();
bufferedWriter.close();
}
catch(IOException e)
{System.out.println("Error");}
}
public static int readfromfile(String path, String key)
{
String dir="d:/files/";
String filename="hello1.txt";
String fullname=dir+filename;
linksAndAt[] linksat=new linksAndAt[10];
BufferedReader bf = null;
try {
bf = new BufferedReader(new FileReader(path));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
String currentLine;
int index =-1;
try{
Runtime.getRuntime().exec("cls");
while((currentLine = bf.readLine()) != null)
{
index=currentLine.indexOf(key);
if(index>0)
{
writetofile(fullname,currentLine);
int count=0;
int lastIndex=0;
while(lastIndex != -1)
{
lastIndex=currentLine.indexOf("href=\"",lastIndex);
if(lastIndex != -1)
{
lastIndex+="href=\"".length();
StringBuilder sb = new StringBuilder();
while(currentLine.charAt(lastIndex) != '\"')
{
sb.append(Character.toString(currentLine.charAt(lastIndex)));
lastIndex++;
}
count++;
System.out.println(sb);
}
}
System.out.println("\n count : " + count);
return index;
}
}
}
catch(FileNotFoundException f)
{
f.printStackTrace();
System.out.println("Error");
}
catch(IOException e)
{try {
bf.close();
} catch (IOException e1) {
e1.printStackTrace();
}}
return index;}
}
答案 0 :(得分:0)
这让我觉得服务器应用程序对来自桌面浏览器和基于Java的爬虫的请求的响应方式不同。这可能是因为您的浏览器在其基于Java的爬虫不是的请求中传递cookie(例如会话持久性cookie),或者可能是因为您的桌面浏览器传递了与爬虫不同的User-Agent标头,或者可能是因为桌面浏览器和Java爬虫之间的其他请求标头不同。
在编写抓取应用程序时,这是遇到的最大问题之一:很容易忘记不同客户端请求的相同URL并不总是使用相同的代码进行响应。不确定这是不是在这里发生了什么,但这很常见。