我想从网站中提取所有新闻文章的网址。这是我所做的:
from bs4 import BeautifulSoup
import requests
url1 = "https://www.wsj.com/search/term.html?KEYWORDS=apple&mod=searchresults_viewallresults"
r1 = requests.get(url1)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
coverpage_news = soup1.find_all("h3", class_="headline")
coverpage_news['href']
但是没有结果,任何帮助将不胜感激。谢谢。
答案 0 :(得分:2)
您可以将硒与PhantomJS一起使用以加载页面,然后将其刮取
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/search/term.html?KEYWORDS=apple&mod=searchresults_viewallresults'
browser = webdriver.PhantomJS(executable_path='D:/Programowanie/phantomjs-2.1.1-windows/bin/phantomjs.exe')
browser.get(url)
element = WebDriverWait(browser, 3)
html = browser.page_source
page_soup = BeautifulSoup(html, 'html5lib')
coverpage_news = page_soup.find_all("h3", class_="headline")
for news in coverpage_news:
# print(news)
news_soap = BeautifulSoup(str(news), 'html.parser')
for a in news_soap.find_all('a', href=True):
print(a['href'])
结果:
/articles/the-stock-market-is-a-strong-election-day-predictor-11599490800?mod=searchresults&page=1&pos=1
/articles/remote-schools-hidden-cost-parents-quit-work-to-teach-prompting-new-recession-woes-11599487201?mod=searchresults&page=1&pos=2
/articles/why-billy-porter-takes-breaks-from-the-news-11599481884?mod=searchresults&page=1&pos=3
/articles/sudden-volatility-in-tech-stocks-unnerves-investors-11599471001?mod=searchresults&page=1&pos=4
/articles/samsung-verizon-sign-6-65-billion-5g-contract-11599469883?mod=searchresults&page=1&pos=5
/articles/where-danger-lurks-in-the-big-tech-rally-11599397200?mod=searchresults&page=1&pos=6
/articles/fortnite-maker-asks-judge-again-to-return-game-to-apples-app-store-11599319938?mod=searchresults&page=1&pos=7
/articles/united-airlines-draftkings-facebook-stocks-that-defined-the-week-11599261143?mod=searchresults&page=1&pos=8
/articles/starved-for-sports-viewers-flock-to-nba-nhl-11599259896?mod=searchresults&page=1&pos=9
/articles/global-stock-markets-dow-update-9-04-2020-11599192206?mod=searchresults&page=1&pos=10
/articles/most-businesses-were-unprepared-for-covid-19-dominos-delivered-11599234424?mod=searchresults&page=1&pos=11
/articles/readers-favorite-summer-recipes-11599238648?mod=searchresults&page=1&pos=12
/articles/softbanks-bet-on-tech-giants-fueled-powerful-market-rally-11599232205?mod=searchresults&page=1&pos=13
/articles/juul-shelves-plan-for-feature-that-counts-puffs-11599211801?mod=searchresults&page=1&pos=14
/articles/apple-still-wears-the-market-crown-it-can-easily-slip-11599231617?mod=searchresults&page=1&pos=15
/articles/eat-a-peach-review-pressure-cooker-11599229971?mod=searchresults&page=1&pos=16
/articles/starting-can-be-the-hardest-part-11599229216?mod=searchresults&page=1&pos=17
/articles/bumbles-buzz-wont-sting-match-11599217202?mod=searchresults&page=1&pos=18
/articles/how-options-market-amateurs-might-have-tripped-up-big-tech-11599213817?mod=searchresults&page=1&pos=19
/articles/global-stock-markets-dow-update-9-03-2020-11599125940?mod=searchresults&page=1&pos=20
答案 1 :(得分:0)
即使我在网络浏览器中关闭了JavaScript
,页面也会显示标题,因此即使没有Selenium
也可以将其抓取,但是您有3个问题/错误。
首先:此服务器检查标头User-Agent
headers = {
"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0"
}
url = "https://www.wsj.com/search/term.html?KEYWORDS=apple&mod=searchresults_viewallresults"
r = requests.get(url, headers=headers)
第二: coverpage_news
是一个列表,因此您必须使用for
循环
for item in coverpage_news:
print(item)
第三步::href
不在<h3>
中,而在<a>
内部的<h3>
中,因此您必须获得.a
for item in coverpage_news:
print(item.a['href'])
最小工作代码
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0"
}
url = "https://www.wsj.com/search/term.html?KEYWORDS=apple&mod=searchresults_viewallresults"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
print('--- using find ---')
coverpage_news = soup.find_all("h3", class_="headline")
for item in coverpage_news:
print(item.a['href'])
print('--- using CSS selector ---')
coverpage_news = soup.select("h3.headline a")
for a in coverpage_news:
print(a['href'])
顺便说一句::我使用'lxml'
和'html.parser'
而不是'html5lib'
和'html.parser'
测试了代码,但找不到HTML中的元素。它表明解析器的工作原理不同,有时可能会引起问题。