总体而言,我对网络抓取还比较陌生。我过去曾经使用过一些lxml,现在我正尝试在bs4中获得更多的知识。这是我在做什么:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Website to be scraped:
url = 'https://www.wsj.com/news/archive/2020/08/28'
# HTTP Request:
response = requests.get(url)
# Extract text from response:
html_content = response.text
# Make some soup:
soup = BeautifulSoup(html_content, 'html')
# Extract Data:
for i in soup.find_all("article", {"class":"WSJTheme--story--XB4V2mLz WSJTheme--padding-top-large--2v7uyj-o styles--padding-top-large--3rrHKJPO WSJTheme--padding-bottom-large--2lt6ga_1 styles--padding-bottom-large--2vWCTk2s WSJTheme--border-bottom--s4hYCt0s "}):
print(i)
我在find_all()
函数中使用这些标记的原因是因为那是我在检查页面后从WSJ网站获得的。该页面看起来非常简单,只有一堆包含主题,标题和日期的容器。这就是我所需要的,但是当我运行代码时,它什么也没找到。
非常感谢您对此提供的反馈。
谢谢!
答案 0 :(得分:2)
要从页面获取信息,请指定User-Agent
HTTP标头。没有它,服务器将返回不同的HTML。
import requests
from bs4 import BeautifulSoup
url = 'https://www.wsj.com/news/archive/2020/08/28'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for article in soup.select('article'):
print(article.span.text)
print(article.h2.text)
print(article.p.text)
print('-' * 80)
打印:
Slideshow
Chadwick Boseman Played Black Icons, Found Fame With ‘Black Panther’
11:20 PM ET
--------------------------------------------------------------------------------
U.S.
George Floyd’s Death Likely Caused by Drug Overdose, Argue Derek Chauvin’s Lawyers
10:59 PM ET
--------------------------------------------------------------------------------
U.S.
Chadwick Boseman, Star of ‘Black Panther,’ Dies of Cancer at 43
10:39 PM ET
--------------------------------------------------------------------------------
Japan
Abe Will Resign as Japan’s Prime Minister, Citing His Health
10:17 PM ET
--------------------------------------------------------------------------------
Politics
Thousands March on National Mall, Continuing Racial-Justice Push
10:11 PM ET
--------------------------------------------------------------------------------
...and so on.