下面是我的代码,它可以工作,但是有时发出它却不起作用?我可以说intermmeidate问题,可能是由于页面中的动态元素引起的?什么是动态元素的解决方案?
def collect_bottom_url(product_string):
"""
collect_bottom_url:
This function will accept product name as a argument.
create a url of product and then collect all the urls given in bottom of page for the product.
:return: list_of_urls
"""
url = 'https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=' + product_string
# download the main webpage of product
webpage = requests.get(url)
# Store the main URL of Product in a list
list_of_urls = list()
list_of_urls.append(url)
# Create a web page of downloaded page using lxml parser
my_soup = BeautifulSoup(webpage.text, "lxml")
# find_all class = pagnLink in web page
urls_at_bottom = my_soup.find_all(class_='pagnLink')
empty_list = list()
for b_url in urls_at_bottom:
empty_list.append(b_url.find('a')['href'])
for item in empty_list:
item = "https://www.amazon.in/" + item
list_of_urls.append(item)
print(list_of_urls)
collect_bottom_url('book')
这里是输出1,这很好:
['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book', 'https://www.amazon.in//book/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Abook', 'https://www.amazon.in//book/s?ie=UTF8&page=3&rh=i%3Aaps%2Ck%3Abook']
此处输出2不正确:
['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book']
答案 0 :(得分:3)
它不是动态的,但是会询问验证码,因为您使用默认的用户代理,请对其进行更改。
headers= {"User-Agent" : 'Mozilla/5.0.............'}
def collect_bottom_url(product_string):
.....
webpage = requests.get(url, headers=headers)
对于动态页面,请使用Selenium。