优化网络浏览

时间:2017-07-28 18:47:25

标签: python-3.x web-scraping beautifulsoup web-crawler

以下抓取虽然很短,却非常缓慢。我的意思是,“在一部长篇故事片中出现,”很慢。

def bestActressDOB():
    # create empty bday list
    bdays = []    
    # for every base url
    for actress in getBestActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress"):
        # use actress list to create unique actress url
        URL = "http://en.wikipedia.org"+actress
        # connect to html
        html = urlopen(URL)
        # create soup object
        bsObj = BeautifulSoup(html, "lxml")
        # get text from <span class='bday">
        try:
            bday = bsObj.find("span", {"class":"bday"}).get_text()
        except AttributeError:
            print(URL)
        bdays.append(bday)
        print(bday)
    return bdays

它从一个维基百科页面上的表格中获取奥斯卡奖提名的每位女演员的名字,然后将其转换为列表,使用这些名称创建URL以访问每个女演员的维基,在那里它抓住她的出生日期。这些数据将用于计算每位女演员被提名或获得奥斯卡最佳女演员奖的年龄。超越Big O,有没有办法实时加快速度。我对这类事情没什么经验,所以我不确定这是多么正常。想法?

编辑:请求的子程序

def getBestActresses(URL):
    bestActressNomineeLinks = []
    html = urlopen(URL)
    try:
        soup = BeautifulSoup(html, "lxml")
        table = soup.find("table", {"class":"wikitable sortable"})
    except AttributeError:
        print("Error creating/navigating soup object")
    table_row = table.find_all("tr")
    for row in table_row:
        first_data_cell = row.find_all("td")[0:1]
        for datum in first_data_cell:
            actress_name = datum.find("a")
            links = actress_name.attrs['href']
            bestActressNomineeLinks.append(links)
    #print(bestActressNomineeLinks)
    return bestActressNomineeLinks

1 个答案:

答案 0 :(得分:1)

我建议尝试更快的计算机,甚至运行谷歌云平台,微软天蓝色或亚马逊网络服务等服务。没有代码可以让它变得更快。