使用套接字的Java Screen Scrape?

时间:2013-01-21 03:48:33

标签: java sockets networking screen scrape

我正在尝试从此网站http://movies.about.com/od/actorsalphalist/Actors_Detailed_Movie_News_Interviews_Websites.htm

收集HTML

我打开一个套接字,尝试读取并打印HTML页面的每一行。当我运行它时,我只得到“EOF为假”,然后“1”作为结果。

我不确定到底出了什么问题,因为我知道这应该在另一个例子中起作用...非常感谢你的帮助!

import java.net.*;
import java.io.*;
import java.util.*;

public class Twitter {

    static final int DEFAULT_PORT = 80;

    protected DataInputStream reply = null;
    protected PrintStream send = null;
    protected Socket sock = null;

    // ***********************************************************
    // *** The constructors create the socket and set up the input
    // *** and output channels on that socket.

    public Twitter() throws UnknownHostException, IOException {
        this(DEFAULT_PORT);
    }

    public Twitter(int port) throws UnknownHostException, IOException {
        sock = new Socket("movies.about.com", port);
        System.out.println(sock);
        reply = new DataInputStream(sock.getInputStream());
        System.out.println();
        send = new PrintStream(sock.getOutputStream());
    }

    // ***********************************************************
    // *** forecast uses the socket that has already been created
    // *** to carry on a conversation with the Web server that it
    // *** has been contacted through the socket.

    public void forecast() {
        int i;
        String HTMLline;
        boolean eof, gotone;

        // *** This issues the same query that a Web browser would issue
        // *** to the Web server.

        try {
            send.println("GET /od/actorsalphalist/Actors_Detailed_Movie_News_Interviews_Websites.htm HTTP/1.1");
        } catch (Exception e) {
            System.out.println("about.com server is down.");
        }

        // *** This section parses the response from the Web server.
        // *** NOTE THAT "real" EOF does not occur until the Web server
        // *** has closed the connection.

        eof = false;
        gotone = false;
        while (!eof) {
            System.out.println("EOF is false");
            try {
                System.out.println("1");
                HTMLline = reply.readLine();
                System.out.println("2");
                System.out.println(HTMLline);
                System.out.println("Here?");
                if (HTMLline != null) {
                    System.out.println("its not null");
                }
                if (HTMLline == null) {
                    System.out.println("WTFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF");
                } else {
                    eof = true;
                    System.out.println("is it?");
                }
            } catch (Exception e) {
                System.out.println("this exception happend");
                e.printStackTrace();
                eof = true;
            }
        }
    }

    // ***********************************************************
    // *** We need to close the socket when this class is destroyed.

    protected void finalize() throws Throwable {
        sock.close();
    }

    // ***********************************************************
    // *** The main program creates a new Twitter class and
    // *** sends that class the command line args (via findNumber).

    public static void main(String[] args) {
        Twitter aboutCom;
        DataInputStream cin = new DataInputStream(System.in);

        try {
            aboutCom = new Twitter();
            aboutCom.forecast();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

1 个答案:

答案 0 :(得分:1)

您尚未发送有效的HTTP请求,因此服务器仍在等待您完成该请求。 GET行必须以\ r \ n结尾,然后您需要另一个作为空行来分隔请求标题。

但是你应该为此使用URL,openConnection(),getInputStream()等,而不是冗余地尝试自己重新实现HTTP。正如你所做的那样,所有你得到的方法都是错误的机会。