套接字接收挂起

时间:2011-07-21 11:08:06

标签: c# .net windows sockets screen-scraping

我正在尝试下载,搜索bing页面,并询问使用套接字,我决定使用套接字,而不是webclient。

socket.Receive(); 在bing,yahoo,google的情况下在几个循环之后挂起但是可以用于询问。谷歌循环将收到4-5次,然后冻结通话。

我无法找出原因?

public string Get(string url)
{
    Uri requestedUri = new Uri(url);
    string fulladdress = requestedUri.Host;
    IPHostEntry entry = Dns.GetHostEntry(fulladdress);
    StringBuilder sb = new StringBuilder();

    try
    {
        using (Socket socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.IP))
        {
            socket.Connect(entry.AddressList[0], 80);

            NetworkStream ns = new NetworkStream(socket);

            string part_request = string.Empty;
            string build_request = string.Empty;
            if (jar.Count != 0)
            {
                part_request = "GET {0} HTTP/1.1\r\nHost: {1} \r\nUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nCookie: {2}\r\nConnection: keep-alive\r\n\r\n";
                build_request = string.Format(part_request, requestedUri.PathAndQuery, requestedUri.Host, GetCookies(requestedUri));
            }
            else
            {
                part_request = "GET {0} HTTP/1.1\r\nHost: {1} \r\nUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nConnection: keep-alive\r\n\r\n";
                build_request = string.Format(part_request, requestedUri.PathAndQuery, requestedUri.Host);
            }

            byte[] data = Encoding.UTF8.GetBytes(build_request);
            socket.Send(data, data.Length, 0);

            byte[] bytesReceived = new byte[102400];
            int bytes = 0;

            do
            {
                bytes = socket.Receive(bytesReceived, bytesReceived.Length, 0);
                sb.Append(Encoding.ASCII.GetString(bytesReceived, 0, bytes));
            }
            while (bytes > 0);

            List<String> CookieHeaders = new List<string>();
            foreach (string header in sb.ToString().Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries))
            {
                if (header.StartsWith("Set-Cookie"))
                {
                    CookieHeaders.Add(header.Replace("Set-Cookie: ", ""));
                }
            }

            this.AddCookies(CookieHeaders, requestedUri);

            socket.Close();
        }
    }
    catch (Exception ex)
    {
        string errorMessage = ex.Message;
    }

    return sb.ToString();
}

CookieContainer jar = new CookieContainer();

public string GetCookies(Uri _uri)
{
    StringBuilder sb = new StringBuilder();
    CookieCollection collection = jar.GetCookies(_uri);

    if (collection.Count != 0)
    {
        foreach (Cookie item in collection)
        {
            sb.Append(item.Name + "=" + item.Value + ";");
        }
    }
    return sb.ToString();
}

3 个答案:

答案 0 :(得分:8)

因为你已经到了内容的末尾,但你仍然要求更多......

do
{
   bytes = socket.Receive(bytesReceived, bytesReceived.Length, 0);
   sb.Append(Encoding.ASCII.GetString(bytesReceived, 0, bytes));
}
while (bytes > 0);

这假设只要最后一个请求返回超过0个字节就更有效,实际上当网络流到达终点时,您可能会在最后一个循环中填充一些缓冲区。 (例如字节&gt; 0但是没有更多内容)...所以服务器关闭了连接。

尝试这样的事情......

do
{
   bytes = socket.Receive(bytesReceived, bytesReceived.Length, 0);
   sb.Append(Encoding.ASCII.GetString(bytesReceived, 0, bytes));
}
while (bytes == bytesReceived.Length);

有些服务器(可能是其中之一)显然不会自动关闭连接,因为你会发现它不会总是失败的原因。

::: EDIT :::

我的测试样本:

加载visual studio,创建一个新的控制台应用程序,然后将以下内容粘贴到生成的程序类中(代替所有现有代码):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Net.Sockets;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string test = Get("http://www.google.co.uk/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a");
            Console.Read();
        }

        public static string Get(string url)
        {
            Uri requestedUri = new Uri(url);
            string fulladdress = requestedUri.Host;
            IPHostEntry entry = Dns.GetHostEntry(fulladdress);
            StringBuilder sb = new StringBuilder();

            try
            {
                using (Socket socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.IP))
                {
                    socket.Connect(entry.AddressList[0], 80);

                    NetworkStream ns = new NetworkStream(socket);

                    string part_request = string.Empty;
                    string build_request = string.Empty;
                    if (jar.Count != 0)
                    {
                        part_request = "GET {0} HTTP/1.1\r\nHost: {1} \r\nUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nCookie: {2}\r\nConnection: keep-alive\r\n\r\n";
                        build_request = string.Format(part_request, requestedUri.PathAndQuery, requestedUri.Host, GetCookies(requestedUri));
                    }
                    else
                    {
                        part_request = "GET {0} HTTP/1.1\r\nHost: {1} \r\nUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nConnection: keep-alive\r\n\r\n";
                        build_request = string.Format(part_request, requestedUri.PathAndQuery, requestedUri.Host);
                    }

                    byte[] data = Encoding.UTF8.GetBytes(build_request);
                    socket.Send(data, data.Length, 0);

                    byte[] bytesReceived = new byte[4096];
                    int bytes = 0;
                    string currentBatch = "";

                    do
                    {
                        bytes = socket.Receive(bytesReceived);
                        currentBatch = Encoding.ASCII.GetString(bytesReceived, 0, bytes);
                        Console.Write(currentBatch);
                        sb.Append(currentBatch);
                    }
                    while (bytes == bytesReceived.Length);

                    List<String> CookieHeaders = new List<string>();
                    foreach (string header in sb.ToString().Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries))
                    {
                        if (header.StartsWith("Set-Cookie"))
                        {
                            CookieHeaders.Add(header.Replace("Set-Cookie: ", ""));
                        }
                    }

                    //this.AddCookies(CookieHeaders, requestedUri);

                    socket.Close();
                }
            }
            catch (Exception ex)
            {
                string errorMessage = ex.Message;
            }

            return sb.ToString();
        }

        static CookieContainer jar = new CookieContainer();

        public static string GetCookies(Uri _uri)
        {
            StringBuilder sb = new StringBuilder();
            CookieCollection collection = jar.GetCookies(_uri);

            if (collection.Count != 0)
            {
                foreach (Cookie item in collection)
                {
                    sb.Append(item.Name + "=" + item.Value + ";");
                }
            }
            return sb.ToString();
        }
        }
    }

我减少了缓冲区,以确保它被填充不止一次......从我的结果来看似乎没问题 这篇文章附带我个人电脑上的典型作品:)

答案 1 :(得分:0)

您正在阅读的内容比您提供的内容更多。

  1. 因此,您打开了与Google的连接并要求提供首页。
  2. Google将为您提供10KB的主页。
  3. 你分配一个102400字节大的缓冲区(也就是100KB大) - 比你需要的多10倍。
  4. 现在,这就是问题发生的地方。

    1. 你一直在阅读主页,一次几个字节,你现在已经达到了10KB的标记。谷歌已经为您提供了整个主页,但是您一直在尝试阅读,试图要求更多数据!现在发生了什么,你只是在等待更多的数据,更多的数据不会来!你只是一直等待,直到你的超时到达。但是因为你已经指定(在你的代码中)直到你读了100KB,但只给了10KB,你就永远不会到那里,似乎挂在那个循环中!
    2. 解决方案?

      检查您是否收到任何字节。

      bytes = socket.Receive(...);
      if (bytes == 0)
      {
          // no more data, exit loop. you can `break;` or use a while loop, as demonstrated below
      }
      

      这可能就是你如何干净地实现它:

      do
      {
         bytes = socket.Receive(...);
         // Process your data
      }
      while (bytes > 0);
      

答案 2 :(得分:0)

测试接收的字节数量应该大部分时间都有效,但是如果最后一块数据与缓冲区长度匹配会发生什么?

byte[] requestBuffer = new byte[100];
int bytesRead;
do
{
    bytesRead = socket.Receive(requestBuffer);
    //do something
}
while (bytes == bytesReceived.Length);

我建议使用Available属性,以确保程序在没有更多可用时停止读取

byte[] requestBuffer = new byte[100];
int bytesRead;
while (socket.Available > 0)    
{
    bytesRead = socket.Receive(requestBuffer);
    //do something
}