Python bs4 Web只抓取返回空值

时间:2020-07-31 17:35:42

标签: python web-scraping beautifulsoup python-requests

我正在尝试抓捕this网站,该网站包含有关即将举行的选举的候选人的信息。

我试图获取候选语句和个人资料图片,它们都包含在“ votewa-candidate-page”标签中,但是每当我尝试抓取数据时,我只会得到空值。

这是我的一些代码:

import requests
from bs4 import BeautifulSoup

url = 'https://voter.votewa.gov/GenericVoterGuide.aspx?e=865&c=17#/candidates/57369/45923'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')  

statement = soup.find('votewa-candidate-page')

谢谢您的帮助。

2 个答案:

答案 0 :(得分:0)

分析网站时,它会通过ajax调用加载数据。

以下脚本将由您打印所需的信息

import requests

res = requests.get("https://voter.votewa.gov/elections/candidate.ashx?e=865&r=57369&b=45923&la=&c=17")

data = res.json()

photo = data[0]['statement']['Photo']
statement = data[0]['statement']['Statement']

print(statement)

我只打印声明,因为照片是base64编码的图像。

输出:

"<p><strong>Elected Experience\n</strong><br />\nUnited States Representative, 2012-Present. Ways and Means Committee and Select Committee to Modernize Congress.\n</p>\n<p><strong>Other Professional Experience</strong><br />\nSuccessful career as a businesswoman and entrepreneur. Former Microsoft executive, led local high-tech startups. Former Director of Washington&rsquo;s Department of Revenue, where I led efforts to simplify the tax system and help small businesses.\n</p>\n<p><strong>Education\n</strong><br />\nB.A., Biology, Reed College; M.B.A., University of Washington.\n</p>\n<p><strong>Community Service\n</strong><br />\nI&rsquo;ve mentored students at UW Business School; been active in my church, serving as a board member. Volunteered with the PTA, Girl Scouts and YWCA, supporting transitional housing, job training and services to help families get back on their feet.\n</p>\n<p><strong>Statement\n</strong><br />\nDuring this pandemic, families across the 1st Congressional District are struggling and concerned about the future. Now more than ever, we need strong leadership that&rsquo;s focused on helping those in need, protecting health and safety and restoring our economy. As your Congresswoman, I am determined that we come through this difficult stretch stronger than ever. My focus is on putting partisanship aside and delivering results.\n</p>\n<p>The first known COVID-19 case struck in Washington State before anywhere else. President Trump denied the threat and wasted precious time. By contrast, I moved quickly, securing funds to backfill state and local public health accounts. Washington State immediately received over $11 million, with continued ongoing support. I also advocated for employee retention tax credits to keep an estimated 60 million people employed with benefits. They were incorporated into the pandemic relief Heroes Act.\n</p>\n<p>As the economy reopens we face an uncertain recovery. We need smart, decisive action to restore our economy. My background as a successful businesswoman and entrepreneur means I understand how to bring businesses back and create jobs.\n</p>\n<p>My proposal to expand child tax credits, which could reduce child poverty 38 percent, has been incorporated into pandemic relief legislation. I also developed provisions, adopted last year, to make the process of applying for financial aid easier for students. I&rsquo;ve pushed to expand farmers&rsquo; access to markets, improve access to broadband, and to increase the supply of affordable housing.\n</p>\n<p>My core values remain unchanged. As I have done from the day I took office, I'll protect Social Security, Medicare and a woman&rsquo;s right to choose. I have endorsements from Democratic groups, labor, local leaders and many others.\n</p>\n<p>The fallout from this pandemic is challenging, but I'm committed to putting people back to work and preserving the middle-class. I ask for your support.</p>"

答案 1 :(得分:0)

我想您不想先搜索json然后再打印,因此这是一个代码,该代码采用您示例中使用的url并自动获取json并打印该语句。

import requests
from bs4 import BeautifulSoup
url = 'https://voter.votewa.gov/GenericVoterGuide.aspx?e=865&c=17#/candidates/57369/45923'

converted = f'https://voter.votewa.gov/elections/candidate.ashx?e=' \
            f'{url.split("?e=")[1].split("&")[0]}&r={url.split("/")[-2]}&b={url.split("/")[-1]}&la=&c={url.split("&c=")[1].split("#")[0]}'

page = requests.get(converted)

data = page.json()

statement = data[0]['statement']['Statement']

soup = BeautifulSoup(statement, 'html.parser')

print(*[p.text for p in soup.select('p')])

打印:

Elected Experience

United States Representative, 2012-Present. Ways and Means Committee and Select Committee to Modernize Congress.
 Other Professional Experience
Successful career as a businesswoman and entrepreneur. Former Microsoft executive, led local high-tech startups. Former Director of Washington’s Department of Revenue, where I led efforts to simplify the tax system and help small businesses.
 Education

B.A., Biology, Reed College; M.B.A., University of Washington.
 Community Service

以此类推...