Python:在BeautifulSoup和Urllib可以解析网站之前需要等待

时间:2014-01-30 00:50:34

标签: web-scraping beautifulsoup urllib

我正在尝试实时获取当前的世界人口,但是当网页首次加载时,需要几秒钟来检索数据。当我运行程序时,我得到加载...而不是人口编号。有没有办法在检索信息之前等到网页完全加载? 提前谢谢!

以下是代码:

import urllib.request
from bs4 import *

htmlfile = urllib.request.urlopen("http://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live")

htmltext = htmlfile.read()

soup = BeautifulSoup(htmltext)
body = soup.find(text="World population").find_previous('p')

print (body.text)

2 个答案:

答案 0 :(得分:0)

import requests
from bs4 import BeautifulSoup
import time

html = requests.get("http://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live").text

while True:
    soup = BeautifulSoup(html)
    body = soup.find(text="World population").find_previous('p')
    if str(body.text).find('loading...') > 1:
        print (body.text)
        break
    time.sleep(30)
    html = requests.get("http://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live").text

答案 1 :(得分:0)

您需要能够从正在下载的网页解释javascript的引擎。

更好的解决方案是找到一些静态版本的网站或其他具有此类信息的网站(我确信该网站实际上并未提供任何信息 - 仅推断数据)

但如果你真的想使用dryscape,你可以使用这种方法

*****************************************
**** Debian package creation selected ***
*****************************************

This package will be built according to these values:

0 -  Maintainer: [ elopez<elopez@> ]
1 -  Summary: [ Net top tool grouping bandwidth per process ]
2 -  Name:    [ nethogs ]
3 -  Version: [ 0.8.1 ]
4 -  Release: [ SNAPSHOT ]
5 -  License: [ GPL2 ]
6 -  Group:   [ checkinstall ]
7 -  Architecture: [ amd64 ]
8 -  Source location: [ https://github.com/raboof/nethogs/ ]
9 -  Alternate source location: [  ]
10 - Requires: [ libc6 (>= 2.4),libgcc1 (>= 1:4.1.1),libncurses5 (>= 5.5-5~),libpcap0.8 (>= 0.9.8),libstdc++6 (>= 4.1.1),libtinfo5 ]
11 - Provides: [ nethogs ]
12 - Conflicts: [  ]
13 - Replaces: [  ]

表示已下载页面的功能是:

   # visiting desired site
   session.set_html("<html></html>")
   session.visit(link)
   # wait 
   session.driver.wait_for(lambda: watToWait(session))