Question

我一直在努力抓取以下网站：http://www.fightingillini.com/schedule.aspx?path=softball

我在过去使用节点/ cheerio / scraperjs刮取静态和动态内容方面有着丰富的经验，但我没有运气破解这个网站。

        scraperjs.DynamicScraper.create('http://www.fightingillini.com/calendar.ashx/calendar.rss?sport_id=9')
            .scrape(function() {
              return $('item').map(function() {
                return $(this).children('title').text();
              }).get();
            }, function(list) {
              console.log(list);
            });

对图书馆的任何帮助/反馈/建议都非常感谢！谢谢！

Answer 1

由于复杂的ViewState隐藏表单输入，Asp.Net Web表单页面可能非常难以清除。有时甚至是一个特征;）

在这种情况下，我会选择rss feed，通过您尝试抓取的页面上的一个链接找到：

ApiConfiguration class here

该链接将为您提供相同的内容，但采用更加友好和标准的XML格式。解析它的代码可能更容易正确解析。最重要的是，这里的格式保证是稳定，而在常规页面上，即使对网站主题进行小的调整也可能会导致解析代码丢失。

关键是rss链接在某种意义上是用于抓取，所以先看一下。

以下是其中一个当前条目的示例：

<item>
<title>2/6 11:30 AM [L] Softball vs  Winthrop</title>
<description>L 1-5 http://www.fightingillini.com/calendar.aspx?id=8670</description>
<link>http://www.fightingillini.com/calendar.aspx?id=8670</link>
<guid isPermaLink="true">http://www.fightingillini.com/calendar.aspx?id=8670</guid>
<ev:gameid>8670</ev:gameid>
<ev:location>Athens, Ga.</ev:location>
<ev:startdate>2015-02-06T17:30:00.0000000Z</ev:startdate>
<ev:enddate>2015-02-06T20:30:00.0000000Z</ev:enddate>
<s:localstartdate>2015-02-06T11:30:00.0000000</s:localstartdate>
<s:localenddate>2015-02-06T14:30:00.0000000</s:localenddate>
<s:teamlogo>http://www.fightingillini.com/images/logos/site/site.png</s:teamlogo>
<s:opponentlogo>http://www.fightingillini.com/images/logos/z16.png</s:opponentlogo>
<s:links>
</s:links>
</item>

该页面还有一个http://www.fightingillini.com/calendar.ashx/calendar.rss?sport_id=9，如果这对您有效。

在节点中刮取.aspx页面

1 个答案: