尝试进行网络抓取。试图使每个国家的人口吐出一种功能。我正在尝试从美国人口普查局进行网上抓取,但是我无法找回正确的信息。
https://www.census.gov/popclock/world/af
<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
<div class="data-cell" style = "background-image: url.....">
<p>population</p>
<h2 data-population="">35.8M</h2>"
这基本上就是我试图抓取的代码。我想要的是“ 3580万”
我尝试了几种方法,但我得到的只是标题“数据填充”,没有数据。
有人向我提到,也许网站以某种格式提供它,所以它不能被刮掉。以我的经验,当它被阻止时,格式看起来会有所不同,它位于图像或动态项目中,或者使其更难抓取。有人对此有任何想法吗?
# -*- coding: utf-8 -*-
# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup
### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.
country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names?
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)
# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format
soup=BeautifulSoup(page.content,"html.parser")
################################################ ################开始我不确定的事情
# Locate element on page to be scraped
# find() locates the element in the BeautifulSoup object
1. First method
population = soup.find(id="basic-facts", class="data-cell")
#I tried some methods like this. got only errors
2. Second method
populaiton = soup.findAll("h2", {"data-population": ""})
for i in population:
print i
# this returns the headings for the table but no data
### here we need to take out the population data
### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################ ################ 结束
# Extract text from the selected BeautifulSoup object using .text
population = population.text
#Final Output
#Return Scraped info
print 'The Population of'+country+'is'+population
我用#######概述了代码。我尝试了几种方法。我列出了两个
总体而言,我对于编码还很陌生,所以请原谅我,如果我没对它进行描述的话,谢谢大家的建议。
答案 0 :(得分:1)
它是从您可以在“网络”标签中找到的API调用中动态检索到的。当您使用的不是浏览器时,您将需要直接发出请求。
import requests
r = requests.get('https://www.census.gov/popclock/apiData_pop.php?get=POP,MPOP0_4,MPOP5_9,MPOP10_14,MPOP15_19,MPOP20_24,MPOP25_29,MPOP30_34,MPOP35_39,MPOP40_44,MPOP45_49,MPOP50_54,MPOP55_59,MPOP60_64,MPOP65_69,MPOP70_74,MPOP75_79,MPOP80_84,MPOP85_89,MPOP90_94,MPOP95_99,MPOP100_,FPOP0_4,FPOP5_9,FPOP10_14,FPOP15_19,FPOP20_24,FPOP25_29,FPOP30_34,FPOP35_39,FPOP40_44,FPOP45_49,FPOP50_54,FPOP55_59,FPOP60_64,FPOP65_69,FPOP70_74,FPOP75_79,FPOP80_84,FPOP85_89,FPOP90_94,FPOP95_99,FPOP100_&key=&YR=2019&FIPS=af').json()
data = list(zip(r[0],r[1]))
print(round(int(data[0][1])/100_0000,1))