网页抓取。无法取回我想要的东西

时间:2019-10-04 18:45:13

标签: python web-scraping beautifulsoup

尝试进行网络抓取。试图使每个国家的人口吐出一种功能。我正在尝试从美国人口普查局进行网上抓取,但是我无法找回正确的信息。

https://www.census.gov/popclock/world/af

<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
   <div class="data-cell" style = "background-image: url.....">
      <p>population</p>
      <h2 data-population="">35.8M</h2>"

这基本上就是我试图抓取的代码。我想要的是“ 3580万”

我尝试了几种方法,但我得到的只是标题“数据填充”,没有数据。

有人向我提到,也许网站以某种格式提供它,所以它不能被刮掉。以我的经验,当它被阻止时,格式看起来会有所不同,它位于图像或动态项目中,或者使其更难抓取。有人对此有任何想法吗?

# -*- coding: utf-8 -*-

# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup

### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.

country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names? 
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the 
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)

# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format

soup=BeautifulSoup(page.content,"html.parser")
################################################ ################开始我不确定的事情
 # Locate element on page to be scraped
 # find() locates the element in the BeautifulSoup object

 1. First method      

 population = soup.find(id="basic-facts", class="data-cell") 
 #I tried some methods like this. got only errors

 2. Second method

 populaiton = soup.findAll("h2", {"data-population": ""})
 for i in population:
 print i

 # this returns the headings for the table but no data

 ### here we need to take out the population data
 ### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################ ################ 结束
# Extract text from the selected BeautifulSoup object using .text
population = population.text

#Final Output
#Return Scraped info

print 'The Population of'+country+'is'+population

我用#######概述了代码。我尝试了几种方法。我列出了两个

总体而言,我对于编码还很陌生,所以请原谅我,如果我没对它进行描述的话,谢谢大家的建议。

1 个答案:

答案 0 :(得分:1)

它是从您可以在“网络”标签中找到的API调用中动态检索到的。当您使用的不是浏览器时,您将需要直接发出请求。

import requests

r = requests.get('https://www.census.gov/popclock/apiData_pop.php?get=POP,MPOP0_4,MPOP5_9,MPOP10_14,MPOP15_19,MPOP20_24,MPOP25_29,MPOP30_34,MPOP35_39,MPOP40_44,MPOP45_49,MPOP50_54,MPOP55_59,MPOP60_64,MPOP65_69,MPOP70_74,MPOP75_79,MPOP80_84,MPOP85_89,MPOP90_94,MPOP95_99,MPOP100_,FPOP0_4,FPOP5_9,FPOP10_14,FPOP15_19,FPOP20_24,FPOP25_29,FPOP30_34,FPOP35_39,FPOP40_44,FPOP45_49,FPOP50_54,FPOP55_59,FPOP60_64,FPOP65_69,FPOP70_74,FPOP75_79,FPOP80_84,FPOP85_89,FPOP90_94,FPOP95_99,FPOP100_&key=&YR=2019&FIPS=af').json()

data = list(zip(r[0],r[1]))
print(round(int(data[0][1])/100_0000,1))