从美丽的汤中的图表中提取文本

时间:2015-04-14 20:42:18

标签: python beautifulsoup

相对较新的beautifulsoup,我试图从此网页中提取数据:http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg==#

我想抓住标题下的数字" Program Completers"," Employed Second Quarter"等.html代码的相关部分是:

<ul class="listbox">              
 <li class="li1">
  <p style="cursor:help" class="listtop" title="WIA Adult 
  completers are those individuals who have exited a WIA Adult program from 
  which the individual received a core staff-assisted service (such as job 
  search or placement assistance) or an intensive service (such as
  counseling, career planning, or job training). Those individuals who 
  participated in WIA through self-service, like OhioMeansJobs.com, or other 
  less intensive programs are not included in the dashboard.">Program 
  Completers</p>
  <p id="programcompleters1">18</p></li>

我想要字符串&#34; Program Completers&#34;和&#34; 18&#34;。我尝试过实施这些解决方案hereherehere,但没有太多运气。我的代码的一个版本是:

from bs4 import BeautifulSoup
import urllib2

url="http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg=="
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',
       'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}

req = urllib2.Request(url, headers=hdr)
page = urllib2.urlopen(req)

soup = BeautifulSoup(page)
for tag in soup.find_all('ul'):
    print tag.text, tag.next_sibling

这会返回文字,但网页的其他部分也会标记为&#39; ul&#39;。我从图表区域内抓取任何文本都没有成功。如何检索我想要的文字?

感谢您的帮助!

2 个答案:

答案 0 :(得分:0)

您想要的元素位于iframe中。尝试从http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=

页面本身中提取

所以,这应该有用

url="http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8="
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

chartcontainers = soup.findAll('div', {"class": "chartcontain"})
for container in chartcontainers:
    print(container)
    #then do whatever

答案 1 :(得分:0)

如前所述,您要查找的数据位于iframe中,请访问@chosen_codex:

http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=

然后,您可以访问您感兴趣的字段:

data = {}
for tag in soup.find_all('p'):
    if tag.get('id'):
        data[tag.get('id')] = tag.text

print(data)

>> print(data.get('programcompleters1'))
18