我正在从下面的网页中删除网址
from bs4 import BeautifulSoup
import requests
url = "https://www.investing.com/search/?q=Axon&tab=news"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
for s in soup.find_all('div',{'class':'articleItem'}):
for a in s.find_all('div',{'class':'textDiv'}):
for b in a.find_all('a',{'class':'title'}):
print(b.get('href'))
输出如下所示
/news/stock-market-news/axovant-updates-on-parkinsons-candidate-axolentipd-1713474
/news/stock-market-news/digital-alley-up-24-on-axon-withdrawal-from-patent-challenge-1728115
/news/stock-market-news/axovant-sciences-misses-by-009-763209
/analysis/microns-mu-shares-gain-on-q3-earnings-beat-upbeat-guidance-200529289
/analysis/axon,-espr,-momo,-zyne-200182141
/analysis/factors-likely-to-impact-axon-enterprises-aaxn-q4-earnings-200391393
{{link}}
{{link}}
问题是
是否可以解决以上两个问题?
答案 0 :(得分:0)
那是因为您正在发出HTTP请求,而youtube使用JavaScript渲染了视频数据。为了能够解析出JS内容,您必须使用支持发出请求然后使用JS呈现的库。尝试使用模块requests_html
。 pypi.org/project/requests-html
答案 1 :(得分:0)
解决这个问题的一种方法是使用硒:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
当硒向下滚动到页面底部时,您将阅读pagesource并关闭硒并使用Beautifulsoup解析pagesource。您也可以使用硒来解析。
第一硒和bs4:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PAUSE_TIME = 1
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get('https://www.investing.com/search/?q=Axon&tab=news')
lh = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(PAUSE_TIME)
nh = driver.execute_script("return document.body.scrollHeight")
if nh == lh:
break
lh = nh
pagesourece = driver.page_source
driver.close()
soup = BeautifulSoup(pagesourece, "html.parser")
for s in soup.find_all('div',{'class':'articleItem'}):
for a in s.find_all('div',{'class':'textDiv'}):
for b in a.find_all('a',{'class':'title'}):
print(b.get('href'))
仅硒版本:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PAUSE_TIME = 1
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get('https://www.investing.com/search/?q=Axon&tab=news')
lh = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(PAUSE_TIME)
nh = driver.execute_script("return document.body.scrollHeight")
if nh == lh:
break
lh = nh
pagesourece = driver.page_source
for s in driver.find_elements_by_css_selector('div.articleItem'):
for a in s.find_elements_by_css_selector('div.textDiv'):
for b in a.find_elements_by_css_selector('a.title'):
print(b.get_attribute('href'))
driver.close()
请注意,您必须安装selenium并下载geckodriver才能运行此程序。如果要在其他路径中使用geckodriver,则必须更改c:/ program:
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
转到您的geckodriver路径。