解析延迟加载结果表(json)

时间:2012-06-19 17:47:22

标签: asp.net html-parsing

我尝试解析此链接:http://agent.bronni.ru/Result.aspx?id=c7a6a33a-174e-426d-b127-828ee612c36e&account=27178&page=1&pageSize=50&mr=true

但是我无法得到结果表,因为我在fiddler中看到了带有json结果的lazyloading方法。

我的代码是:

HtmlWeb hw = new HtmlWeb();         HtmlDocument doc = hw.Load(“http://agent.bronni.ru/Result.aspx?id=c7a6a33a-174e-426d-b127-828ee612c36e&account=27178&page=1&pageSize=50&mr=true” );

    // Get all tables in the document
    HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");

    // Iterate all rows in the first table
    HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");

    var data = rows.Skip(1).ToList().Take(10).ToList().Select(x => new TableRow()
    {
        Price = x.SelectNodes(".//td").ToList()[4].InnerText,
        Operator = x.SelectNodes(".//td").ToList()[15].InnerText,
        DepartureDate = x.SelectNodes(".//td").ToList()[6].InnerText,
        DestinationRegion = x.SelectNodes(".//td").ToList()[7].InnerText
    }).ToList();

更新的 第二站点: 代码

 WebClient wc = new WebClient();
        wc.Headers.Add("Referer", "http://sletat.ru/");//MUST BE THIS HEADER
        string result = wc.DownloadString("http://module.sletat.ru/Main.svc/GetTours?cityFromId=832&countryId=35&cities=&meals=&stars=&hotels=&s_adults=1&s_kids=0&s_kids_ages=&s_nightsMin=6&s_nightsMax=16&s_priceMin=0&s_priceMax=&currencyAlias=RUB&s_departFrom=25%2F06%2F2012&s_departTo=31%2F07%2F2012&visibleOperators=&s_hotelIsNotInStop=true&s_hasTickets=true&s_ticketsIncluded=true&debug=0&filter=0&f_to_id=&requestId=19198631&pageSize=20&pageNumber=1&updateResult=1&includeDescriptions=1&includeOilTaxesAndVisa=1&userId=&jskey=1&callback=_jqjsp&_1340633427022=");
        result = result.Substring(result.IndexOf("{"), result.LastIndexOf("}") - result.IndexOf("{") + 1);
        JavaScriptSerializer js = new JavaScriptSerializer();
        dynamic json = js.DeserializeObject(result);
        var prices = json["GetToursResult"]["Data"]["aaData"] as object[];
        // var operators = ((object[])json["result"]["prices"]).Cast<Dictionary<string, object>>();
        var temp = prices.ToList().Take(20).Select(x => new TableRow
        {
            Operator = (x as object[])[40].ToString(),
            //Price = x["operatorPrice"].ToString(),
            //DepartureDate = x["checkinDate"].ToString(),
            //DestinationRegion = ((Dictionary<string, object>)x["country"])["englishName"].ToString()
        }).ToList();

        string str = "";

        foreach (var tableRow in temp)
        {
            str += tableRow.Operator + "<br />";
        }
        Response.Write(str);

通过这种方式我尝试所有工作正常但问题是这个链接工作大约30分钟,然后我需要再次放入其他链接。有什么方法可以解决这个问题吗?(只有第二个网站有它) 再次谢谢,

1 个答案:

答案 0 :(得分:0)

数据真的来自这里:

http://beta.remote.bronni.ru/LazyLoading.ashx/getResult?jsonp=jQuery17207647891761735082_1340131755603&id=c7a6a33a-174e-426d-b127-828ee612c36e&page=3&pageSize=50&_=1340131756631

可以动态调整page=#pageSize=#

因此,您只需从URL获取JSON数据并解析它,而不是解析HTML。例如:

WebClient wc = new WebClient();
string result =wc.DownloadString("http://beta.remote.bronni.ru/LazyLoading.ashx/getResult?jsonp=jQuery17207647891761735082_1340131755603&id=c7a6a33a-174e-426d-b127-828ee612c36e&page=1&pageSize=1000&_=1340131756631");
result = result.Substring(result.IndexOf("{"),result.LastIndexOf("}")-result.IndexOf("{")+1);
JavaScriptSerializer js = new JavaScriptSerializer();
dynamic json =  js.DeserializeObject(result);
var prices = ((object[])json["result"]["prices"]).Cast<Dictionary<string,object>>();
var data = from p in prices 
           select new
{
  OperatorID = p["operatorID"],
  Price = p["operatorPrice"],
  Country = ((Dictionary<string,object>)p["country"])["englishName"],
  CheckinDate = p["checkinDate"]
};

Console.WriteLine(data);

在我的LinqPad程序中,生成如下内容:

OperatorID Price Country CheckinDate 
0          1,27  Greece  2012-06-28 
0          55,90 Greece  2012-06-28 
0          67,34 Greece  2012-06-28 

还有更多行,具体取决于您要求的数量......

注意result = result.Substring(result.IndexOf("{"),result.LastIndexOf("}")-result.IndexOf("{")+1);行的原因是jsonp结果在开头有这个垃圾:

jQuery17207647891761735082_1340131755603({"

使用})结束,这会使JavascriptSerializer在尝试解析它时窒息;因此需要将其删除。

<强>更新

有趣的是,返回数据的ASHX处理程序似乎在请求中需要Referer标头;否则,响应将不包括运营商信息。所需的Referer不能是你想要的任何东西,它似乎实际上正在寻找http://agent.bronni.ru

基本上,您需要做的就是:

WebClient wc = new WebClient();
wc.Headers.Add("Referer","http://agent.bronni.ru");//MUST BE THIS HEADER
string result =wc.DownloadString("http://beta.remote.bronni.ru/LazyLoading.ashx/getResult?jsonp=jQuery17207647891761735082_1340131755603&id=c7a6a33a-174e-426d-b127-828ee612c36e&page=1&pageSize=1000&_=1340131756631");
result = result.Substring(result.IndexOf("{"),result.LastIndexOf("}")-result.IndexOf("{")+1);
JavaScriptSerializer js = new JavaScriptSerializer();
dynamic json =  js.DeserializeObject(result);
var prices = ((object[])json["result"]["prices"]).Cast<Dictionary<string,object>>();
var data = from p in prices 
           select new
{
  OperatorID = p["operatorID"],
  Price = p["operatorPrice"],
  Country = ((Dictionary<string,object>)p["country"])["englishName"],
  Hotel = ((Dictionary<string,object>)p["hotel"])["englishName"],
  Operator = ((Dictionary<string,object>)p["operator"])["englishName"],//OPERATOR
  CheckinDate = p["checkinDate"]
};

OperatorID Price Country Hotel                           Operator          CheckinDate 
19681      1,27  Greece  Julia Hotel                     Mouzenidis Travel 2012-06-28 
19681      1,27  Greece  Forest Park                     Mouzenidis Travel 2012-06-28 
19681      1,27  Greece  Kassandra Mare (ï-îâ Êàññàíäðà) Mouzenidis Travel 2012-06-28 

更新2:

我决定比较开箱即用的Javascriptserializer与JSON.NET serializer的性能,并且在我的所有测试中使用不同的记录大小(50,1000,3000),JSON.NET至少快两倍Javascriptserializer,在某些情况下甚至比较小的记录集快10倍。

如果您决定使用JSON.NET库,这里的代码将为您提供与上述代码相同的结果:

WebClient wc = new WebClient();
wc.Headers.Add("Referer","http://agent.bronni.ru");
string result =wc.DownloadString("http://beta.remote.bronni.ru/LazyLoading.ashx/getResult?jsonp=jQuery17207647891761735082_1340131755603&id=c7a6a33a-174e-426d-b127-828ee612c36e&page=1&pageSize=50&_=1340131756631");
result = result.Substring(result.IndexOf("{"),result.LastIndexOf("}")-result.IndexOf("{")+1);
JObject o = JObject.Parse(result);
var data = from x in o["result"]["prices"]
select new
 {
  OperatorID = x["operatorID"],
  Price = x["operatorPrice"],
  Country = x["country"]["englishName"],
  Hotel = x["hotel"]["englishName"],
  Operator = x["operator"]["englishName"],
  CheckinDate = x["checkinDate"]
};

Console.WriteLine(data);