网页抓取普查数据

时间:2018-12-15 21:45:16

标签: python pandas web-scraping beautifulsoup census

我正在尝试从统计图集网站的教育程度部分的第一张表中基于人口普查数据抓取数据。本质上,我想从表中从网上抓取百分比,然后将这些百分比添加到数据框,该数据框的最左侧具有邮政编码,并且具有用于HS,无HS和更高等级的单独列。我正在尝试对纽约市的所有邮政编码进行此操作。

到目前为止,这是我提供的代码,您能否帮助我对其进行优化,以便我可以循环浏览所有邮政编码,并从第一张带有邮政编码的表格中获取每个教育类别的列数据框在纽约市吗?

以下是统计图集的链接:https://statisticalatlas.com/place/New-York/New-York/Overview

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
import getpass
import os
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen

file_name = 'C:/Users/Nicholas_G/Desktop/Google Drive/Work/Free 
Lance/Political Targeting/Census Data.xlsx'
sheet_name = 'NYC Zip Target'
Census_Data = pd.read_excel(file_name, sheet_name=sheet_name)

zip_list = list(a for a in Census_Data['RESIDENTIAL_ZIP'])

url = "https://statisticalatlas.com/place/New-York/New-York/Overview"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
l = []

for a in zip_list:
   r = requests.get(f"https://statisticalatlas.com/zip/{a}/Educational- 
Attainment")
   s = BeautifulSoup(r.text, 'lxml')
   data = s.find('svg', {'viewBox': '0 0 400 79'})
   value = data.find('svg', {'fill': '#000'})
   l.append(value)

1 个答案:

答案 0 :(得分:0)

我对多重处理不是很熟悉,否则我会顺其自然,但这是我使用Session的版本

import requests
import pandas as pd
from bs4 import BeautifulSoup

urlMain = 'https://statisticalatlas.com/place/New-York/New-York/Overview'
urlAttainment = 'https://statisticalatlas.com/zip/{}/Educational-Attainment'

def getPercentages(url):
    res = requests.get(url)
    if res.status_code == 200:
        soup = BeautifulSoup(res.content, "lxml")
        percentages = soup.select('[id="figure/educational-attainment"] rect title')
        percentages = [percentages[0].text,percentages[2].text,percentages[4].text]
        return percentages
    else:
        print(res.status_code, url)
        return []

def getCodes(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")
    codes = [code.text for code in soup.select('.info-table-contents-div a[href*=zip]')]
    return codes

results = []

with requests.Session() as s:
    zipcodes = getCodes(urlMain)

    for zipcode in zipcodes:
        try:
            row = getPercentages(urlAttainment.format(zipcode))
            row.insert(0, zipcode)
            results.append(row)
        except IndexError as ex:
            print(ex,urlAttainment.format(zipcode))
df = pd.DataFrame(results,columns=['zipcode', 'HD', 'HS', 'NoHS'])
print(df)