由于我想废弃此链接http://clists.nic.in/viewlist/search_result_final.php中的信息,但实际网页包含一些数据,而我试图废弃它,它会转到网页的上一页为什么我不知道我认为它它似乎是更敏感的数据。我无法得到任何人建议和帮助我的确切解决方案。谢谢你提前。 这是我的code.py
from bs4 import BeautifulSoup
import requests
page_link = 'http://clists.nic.in/viewlist/search_result_final.php'
# fetch the content from url
from user_agent import generate_user_agent
# generate a user agent
headers = {'User-Agent': generate_user_agent(device_type="desktop", os=('mac', 'linux'))}
print(headers)
# headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.63 Safari/537.36'}
page_response = requests.get(page_link, timeout=3, headers=headers)
print(page_response)
# parse html
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content)
# extract all html elements where price is stored
prices = page_content.find_all(class_='main_price')
print(prices)
# prices has a form:
# [<div class="main_price">Price: $66.68</div>,
# <div class="main_price">Price: $56.68</div>]
priceser = page_content.find(id='Head1')
print(priceser)
# check if the element with such id exists or not
if priceser is None:
# NOTIFY! LOG IT, COUNT IT
print("yes")
else:
print("No")
try:
page_response = requests.get(page_link, timeout=5)
if page_response.status_code == 200:
print("extracted")
else:
print(page_response.status_code)
# notify, try again
except requests.Timeout as e:
print("It is time to timeout")
print(str(e))
except: # other exception
pass
# do something
# you can also access the main_price class by specifying the tag of the class
pricess = page_content.find_all('div', attrs={'class': 'main_price'})
print(pricess)
# proxies = {'http' : 'http://10.10.0.0:0000',
# 'https': 'http://120.10.0.0:0000'}
# page_response = requests.get(page_link, proxies=proxies, timeout=5)
对于任何其他网页,其工作正常,但我不适用于此链接,因为它不是实际页面。