硒,BeatifulSoup,循环请求

时间:2018-08-25 02:49:23

标签: python selenium python-requests

我只是好奇,因为我正在尝试学习python。我正在从网站提取数据。无论如何,我要做的是滚动页面,然后获取与URL对应的标题,并且在循环中我请求此URL并用BS提取。显然,它不起作用,我请求的HTML像这样打印:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>

这是python代码

import time
import requests

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup

chromedriver = "/Users/eduardfossas/Downloads/chromedriver"
driver = webdriver.Chrome(chromedriver)

my_url = driver.get("https://wodwell.com/wods/?sort=newest&category=none&feeds=736")
time.sleep(1)

elem = driver.find_element_by_tag_name("body")

no_of_pagedowns = 20

while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0)
    no_of_pagedowns-=1

containers = driver.find_elements_by_class_name('wod-title')

for container in containers:

    my_sub_url = "https://wodwell.com/wod/"
    my_sub_url = my_sub_url + container.text + '/'
    page_html = requests.get(my_sub_url).text

    page_soup = soup(page_html, "html.parser")

    main_text = page_soup.select('.workout-list')

有什么方法可以使BeatifulSoup和请求正常工作吗?如果没有,我想我会尝试Scrapy。

您推荐什么?

致以诚挚的问候,

1 个答案:

答案 0 :(得分:0)

那从来没有发生过。

也许对某人有用,它在指定标题后立即起作用:

headers = {'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
page_html = requests.get(my_sub_url, headers = headers).text