BS4的Web Scraping WSJ存档

时间:2020-08-29 04:48:21

标签: python web-scraping beautifulsoup

总体而言,我对网络抓取还比较陌生。我过去曾经使用过一些lxml,现在我正尝试在bs4中获得更多的知识。这是我在做什么:

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Website to be scraped:
url = 'https://www.wsj.com/news/archive/2020/08/28'

# HTTP Request:
response = requests.get(url)

# Extract text from response:
html_content = response.text

# Make some soup:
soup = BeautifulSoup(html_content, 'html')

# Extract Data:
for i in soup.find_all("article", {"class":"WSJTheme--story--XB4V2mLz WSJTheme--padding-top-large--2v7uyj-o styles--padding-top-large--3rrHKJPO WSJTheme--padding-bottom-large--2lt6ga_1 styles--padding-bottom-large--2vWCTk2s WSJTheme--border-bottom--s4hYCt0s "}):
  print(i)

我在find_all()函数中使用这些标记的原因是因为那是我在检查页面后从WSJ网站获得的。该页面看起来非常简单,只有一堆包含主题,标题和日期的容器。这就是我所需要的,但是当我运行代码时,它什么也没找到。

非常感谢您对此提供的反馈。

谢谢!

1 个答案:

答案 0 :(得分:2)

要从页面获取信息,请指定User-Agent HTTP标头。没有它,服务器将返回不同的HTML。

import requests
from bs4 import BeautifulSoup


url = 'https://www.wsj.com/news/archive/2020/08/28'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for article in soup.select('article'):
    print(article.span.text)
    print(article.h2.text)
    print(article.p.text)
    print('-' * 80)

打印:

Slideshow
Chadwick Boseman Played Black Icons, Found Fame With ‘Black Panther’
11:20 PM ET
--------------------------------------------------------------------------------
U.S.
George Floyd’s Death Likely Caused by Drug Overdose, Argue Derek Chauvin’s Lawyers
10:59 PM ET
--------------------------------------------------------------------------------
U.S.
Chadwick Boseman, Star of ‘Black Panther,’ Dies of Cancer at 43 
10:39 PM ET
--------------------------------------------------------------------------------
Japan
Abe Will Resign as Japan’s Prime Minister, Citing His Health
10:17 PM ET
--------------------------------------------------------------------------------
Politics
Thousands March on National Mall, Continuing Racial-Justice Push
10:11 PM ET
--------------------------------------------------------------------------------

...and so on.