Web Scraper使用Scrapy

时间:2014-05-15 13:07:11

标签: python scrapy

我只需要解析this link中的位置和点数。该链接有{21个列表(我实际上不知道它叫什么)enter image description here,每个列表上有40个玩家enter image description here期望最后一个。现在我编写了一个类似的代码,

from bs4 import BeautifulSoup
import urllib2

def overall_standing():
    url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#", 
                "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
                "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]

    gDictPlayerPointsInfo = {}
    for url in url_list:
        print url
        header = {'User-Agent': 'Mozilla/5.0'}
        header = {'User-Agent': 'Mozilla/5.0'}
        req = urllib2.Request(url,headers=header)
        page = urllib2.urlopen(req)
        soup = BeautifulSoup(page)
        table = soup.find("table", { "class" : "ladder zebra player-ratings" })

        lCount = 1
        for row in table.find_all("tr"):
            lPlayerName = ""
            lTeamName = ""
            lPosition = ""
            lPoint = ""
            for cell in row.find_all("td"):
                if lCount == 2:
                    lPlayerName = str(cell.get_text()).strip().upper()
                elif lCount == 3:
                    lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
                elif lCount == 4:
                    lPosition = str(cell.get_text().strip())
                elif lCount == 6:
                    lPoint = str(cell.get_text().strip())

                lCount += 1

            if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
                print lTeamName, lPlayerName, lPoint
            if lPlayerName <> "" and lTeamName <> "":
                lStr = lPosition + "," + lPoint

#                 if gDictPlayerPointsInfo.has_key(lTeamName):
#                     gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
#                 else:
                gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
            lCount = 1


    lfp = open("a.txt","w")
    for key in gDictPlayerPointsInfo:
        if key.find("RICHMOND"):
            lfp.write(str(gDictPlayerPointsInfo[key]))

    lfp.close()
    return gDictPlayerPointsInfo


# overall_standing()

但问题是它总是给我第一个上市的积分和位置,它忽略了其他20.我怎么能得到整个21的位置和积分?现在我听说scrapy可以做这种类型的事情很容易,但我并不完全熟悉scrapy。除了使用scrapy之外还有其他方法吗?

1 个答案:

答案 0 :(得分:5)

这种情况正在发生,因为这些链接由服务器处理,并且通常由#符号后面跟着javascript符号的部分,由浏览器处理并引用某些链接或scrapy行为,即加载一组不同的结果。

我会建议两个appraoches,要么找到一种方法来使用服务器可以评估的链接,你可以继续使用selenium或使用像javascript这样的网络驱动程序。

<强> Scrapy

您的第一步是识别ajax加载通话http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40 ,并使用这些链接提取您的信息。这些是对网站数据库的调用。这可以通过在您单击下一个搜索结果页面时打开Web检查器并观察网络流量来完成:

Before clicking

然后点击

after clicking

我们可以看到这个网址有一个新的调用:

json

此网址会返回一个可以解析的def gen_url(page_no): return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40" 文件,您甚至可以缩短步骤,看起来您可以控制更多信息返回给您。

您可以编写一种方法来为您生成一系列链接:

scrapy

然后,例如,将seed = [gen_url(i) for i in range(20)] 与种子列表一起使用:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200

或者你可以尝试调整url参数,看看你得到了什么,也许你可以一次获得多个页面:

pageSize

我将200参数的结尾更改为from selenium import webdriver driver = webdriver.Firefox() driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings") ,因为它似乎直接与返回的结果数相对应。

注意此方法可能无法正常工作,因为网站有时会通过筛选请求来源的IP来阻止其数据API在外部使用。

如果是这种情况,您应该采用以下方法。

Selenium(或其他网络驱动程序)

使用类似fragment identifier的webdriver,您可以使用加载到浏览器中的内容来评估服务器返回网页后加载的数据。

为了使selen可以使用,需要设置一些selenium,但是一旦你有了它,它就是一个非常强大的工具。

一个简单的例子是:

scrapy

你会看到一个python控制的Firefox浏览器(这也可以通过其他浏览器完成)在你的屏幕上打开并加载你提供的url,然后按照你提供的命令执行,这可以通过shell完成(对于您可以使用与driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']") 相同的方式搜索和解析html(来自前一代码部分的代码...)

如果您想执行点击下一页按钮的操作:

li

该表达式可能需要进行一些调整,但它打算查找class='page'div class='pagination' // /html/body/div/div/.....的所有//div/...元素意味着元素之间的路径缩短,你的另一个选择就像driver,直到你找到有问题的那个,这就是html有用和吸引人的原因。

有关定位元素的具体帮助和参考,请参阅initial setup

我通常的方法是试验和错误,调整表达式直到它达到我想要的目标元素。这是控制台/外壳派上用场的地方。如上所述设置<html> <head></head> <body> <div id="container"> <div id="info-i-want"> treasure chest </div> </div> </body> </html> 后,我通常会尝试构建表达式:

假设你有一个>>> print driver.get_element_by_xpath("//body") '<body> <div id="container"> <div id="info-i-want"> treasure chest </div> </div> </body>' >>> print driver.get_element_by_xpath("//div[@id='container']") <div id="container"> <div id="info-i-want"> treasure chest </div> </div> >>> print driver.get_element_by_xpath("//div[@id='info-i-want']") <div id="info-i-want"> treasure chest </div> >>> print driver.get_element_by_xpath("//div[@id='info-i-want']/text()") treasure chest >>> # BOOM TREASURE! 结构,如:

links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

我会从以下内容开始:

import time
from selenium import webdriver

driver = None
try:
    driver = webdriver.Firefox()
    driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")

    #
    # Scrape the first page
    #

    links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

    for link in links:
        link.click()
        #
        # scrape the next page
        #
        time.sleep(1) # pause for a time period to let the data load
finally:
    if driver:
        driver.close()

通常它会更复杂,但这是一种很好的,通常是必要的调试策略。

回到你的情况,你可以将它们保存到一个数组中:

try...finally

然后逐个点击它们,抓取新数据,点击下一个:

{{1}}

最好将它全部包装在{{1}}类型块中,以确保关闭驱动程序实例。

如果您决定深入研究selenium方法,可以参考他们的their page,其中包含优秀且非常明确的文档和docs

快乐刮!