嗨,我关注并理解了这篇关于如何从网站读取内容的文章,该文章非常有效: geeksforgeeks.org:Reading selected webpage content using Python Web Scraping
但是当我更改代码以与另一个站点一起使用时,它不返回任何值。我正在尝试获取那些Value1和Value2等。如下所示。
请注意:从该网页读取内容是合法的。
import requests
from bs4 import BeautifulSoup
# the target we want to open
url='https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at'
#open with GET method
resp=requests.get(url)
#http_respone 200 means OK status
if resp.status_code==200:
print("Successfully opened the web page")
print("The news are as follow :-\n")
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
# l is the list which contains all the text i.e news
l=soup.find("tr","spec-directory-entry daisy-table__row fade fade--show")
#now we want to print only the text part of the anchor.
#find all the elements of a, i.e anchor
for i in l:
print(i.text)
else:
print("Error")
这是网站源代码:
<tr class="spec-directory-entry daisy-table__row fade fade--show">
<a href="/livestream" class="daisy-link spec-profile-name">Value1</a>
<tr class="spec-directory-entry daisy-table__row fade fade--show">
<a href="/livestream" class="daisy-link spec-profile-name">Value2</a>
<tr class="spec-directory-entry daisy-table__row fade fade--show">
.
.
.
答案 0 :(得分:3)
需要JavaScript来呈现网页内容。使用prerenderio服务是一种从页面中获取所需数据的简便方法。
import requests
from bs4 import BeautifulSoup
# the target we want to open
# changed to use prerenderio service
url='http://service.prerender.io/https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at'
#open with GET method
resp=requests.get(url)
#http_respone 200 means OK status
if resp.status_code==200:
print("Successfully opened the web page")
print("The news are as follow :-\n")
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
# l is the list which contains all the text i.e news
l=soup.find("tr","spec-directory-entry daisy-table__row fade fade--show")
#now we want to print only the text part of the anchor.
#find all the elements of a, i.e anchor
for i in l:
print(i.text)
else:
print("Error")
以上代码返回的数据:
Successfully opened the web page
The news are as follow :-
LivestreamManaged
04 / 2019
73
$100
$150-$250
已编辑:回复Ahmad's comment
这里是仅获取“ Livestream”表行的值的代码。
import requests
from bs4 import BeautifulSoup
# the target we want to open
# changed to use prerenderio service
url='http://service.prerender.io/https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at'
#open with GET method
resp=requests.get(url)
#http_respone 200 means OK status
if resp.status_code==200:
print("Successfully opened the web page")
print("The news are as follow :-\n")
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
# l is the list which contains all "tr" tags
l=soup.findAll("tr","spec-directory-entry daisy-table__row fade fade--show")
# looping through the list of table rows
for i in l:
# checking if the current row is for 'Livestream'
if i.find('a').text == 'Livestream':
# printing the row's values except the first "td" tag
for e in i.findAll('td')[1:]:
print(e.text)
else:
print("Error")
结果:
Successfully opened the web page
The news are as follow :-
04 / 2019
73
$100
$150-$250
答案 1 :(得分:1)
看起来像JS渲染到页面上。您可以同时使用硒和美丽汤来获取价值。
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at")
time.sleep(5)
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
for a in soup.select("a.spec-profile-name[href='\/livestream']"):
print(a.text)
答案 2 :(得分:0)
查看请求实际获取的内容,看来此页面依赖于动态内容。您的请求中返回以下文本:
It looks like your JavaScript is disabled. To use HackerOne, enable JavaScript in your browser and refresh this page.
您会收到“ TypeError:'NoneType'对象不可迭代”的信息,因为没有Javascript,BeautifulSoup就找不到并进行迭代的“ tr”元素。您将不得不使用诸如selenium之类的东西来模拟运行Javascript的浏览器,以获取所需的HTML。