我正试图在显示ddos预防页面的同时刮掉一个延迟5秒的网站,网站是
我正在使用Python3和BeuwtifulSoup,我想我需要在发送请求之后和阅读内容之前引入时间延迟。
这是我到目前为止所做的事情
import requests
from bs4 import BeautifulSoup
url = 'https://koinex.in/'
response = requests.get(url)
html = response.content
答案 0 :(得分:1)
它使用JavaScript生成一些值,该值发送到页面https://koinex.in/cdn-cgi/l/chk_jschl
并获取cookie cf_clearance
,该页面由页面检查以跳过页面。
代码可以在每个请求中使用不同的参数和不同的方法生成值,因此可以更容易地使用Selenium来获取数据
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get('https://koinex.in/')
time.sleep(8)
tables = driver.find_elements_by_tag_name('table')
for item in tables:
print(item.text)
#print(item.get_attribute("value"))
结果
VOLUME PRICE/ETH
5.2310 64,300.00
0.0930 64,100.00
10.7670 64,025.01
0.0840 64,000.00
0.3300 63,800.00
0.2800 63,701.00
0.4880 63,700.00
0.7060 63,511.00
0.5020 63,501.00
0.1010 63,500.01
1.4850 63,500.00
1.0000 63,254.00
0.0300 63,253.00
VOLUME PRICE/ETH
1.0000 64,379.00
0.0940 64,380.00
0.9710 64,398.00
0.0350 64,399.00
0.7170 64,400.00
0.3000 64,479.00
5.1650 64,480.35
0.0020 64,495.00
0.2000 64,496.00
9.5630 64,500.00
0.4000 64,501.01
0.0400 64,550.00
0.5220 64,600.00
DATE VOLUME PRICE/ETH
31/12/2017, 12:19:29 0.2770 64,300.00
31/12/2017, 12:19:11 0.5000 64,300.00
31/12/2017, 12:18:28 0.3440 64,025.01
31/12/2017, 12:18:28 0.0750 64,026.00
31/12/2017, 12:17:50 0.0010 64,300.00
31/12/2017, 12:17:47 0.0150 64,300.00
31/12/2017, 12:15:45 0.6720 64,385.00
31/12/2017, 12:15:45 0.2000 64,300.00
31/12/2017, 12:15:45 0.0620 64,300.00
31/12/2017, 12:15:45 0.0650 64,199.97
31/12/2017, 12:15:45 0.0010 64,190.00
31/12/2017, 12:15:45 0.0030 64,190.00
31/12/2017, 12:15:25 0.0010 64,190.00
您还可以从HTML
获取Selenium
并与BeautifulSoup
一起使用
soup = BeautifulSoup(driver.page_source)
但Selenium
可以使用xpath
,css selector
和其他方法获取数据,因此大多数情况下无需使用BeautifulSoup
请参阅文档:4. Locating Elements
编辑:此代码使用Selenium
中的Cookie加载requests
页面,并且DDoS页面没有任何问题。
问题是该页面使用JavaScript来显示表格,因此您无法使用requests
+ BeautifulSoup
来获取它们。但也许你会发现JavaScript使用的URL来获取表的数据,然后requests
可能会有用。
from selenium import webdriver
import time
# --- Selenium ---
url = 'https://koinex.in/'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(8)
#tables = driver.find_elements_by_tag_name('table')
#for item in tables:
# print(item.text)
# --- convert cookies/headers from Selenium to Requests ---
cookies = driver.get_cookies()
for item in cookies:
print('name:', item['name'])
print('value:', item['value'])
print('path:', item['path'])
print('domain:', item['domain'])
print('expiry:', item['expiry'])
print('secure:', item['secure'])
print('httpOnly:', item['httpOnly'])
print('----')
# convert list of dictionaries into dictionary
cookies = {c['name']: c['value'] for c in cookies}
# it has to be full `User-Agent` used in Browser/Selenium (it can't be short 'Mozilla/5.0')
headers = {'User-Agent': driver.execute_script('return navigator.userAgent')}
# --- requests + BeautifulSoup ---
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.headers.update(headers)
s.cookies.update(cookies)
r = s.get(url)
print(r.text)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')
print('tables:', len(tables))
for item in tables:
print(item.get_text())