使用Python中的selenium刮取动态(AJAX)网站

时间:2018-06-13 15:52:19

标签: python ajax selenium web-scraping beautifulsoup

我有一个基于AJAX的网站https://stackshare.io/application_and_data。我试图在所有页面上刮掉技术堆栈的徽标。我使用selenium来find_element_by_class - 它返回一个空列表。在XHR请求中找到的JQuery没有我可以使用的URL。需要帮助对jQuery脚本进行反向工程。

我在网络数据中找到的其他网址似乎也失败了。我试图邮递员复制请求,但无法正确执行。

非常感谢任何帮助。

import time
import requests
from bs4 import BeautifulSoup
import urlparse
import os
from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Firefox(executable_path="/home/Documents/geckodriver")

driver.get("https://stackshare.io/application_and_data/")
content = driver.find_elements_by_class_name("btn btn-ss-alt btn-lg load-more-layer-stacks")

content_1 = driver.find_elements_by_class_name("div-center hidden-xs")

Content和content_1给出一个空列表。我该怎么办?或者我在这里犯了什么错?

以下是我尝试的逆向工程方法。

request_url = 'https://stackshare.io/application_and_data/load-more'
request_headers = {
'Accept' : '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language'   : 'en-GB,en;q=0.5',
'Connection'    : 'keep-alive',
'Content-Length'    : '128',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'cookie' :'_stackshare_production_session=cUNIOVlrV0h2dStCandILzJDWmVReGRlaWI1SjJHOWpYdDlEK3BzY2JEWjF3Lzd6Z0F6Zmg1RjUzNGo0U1dPNFg2WHdueDl5VEhCSHVtS2JiaVdNN0FvRWJMV0pBS0ZaZ0RWYW14bFFBcm1OaDV6RUptZlJMZ29TQlNOK1pKOFZ3NTVLbEdmdjFhQnRLZDl1d29rSHVnPT0tLWFzQlcrcy9iQndBNW15c0lHVHlJNkE9PQ%3D%3D--b0c41a10e8b0cf8cd020f7b07d6507894e50a9c5; ajs_user_id=null; ajs_group_id=null; ajs_anonymous_id=%224cf45ffc-a1ab-4048-94ba-d8c58063df95%22; wooTracker=Psbca0UX84Do; _ga=GA1.2.877065752.1528363377; amplitude_id_63407ddf709a227ea844317f20f7b56estackshare.io=eyJkZXZpY2VJZCI6IjcwYmNiMGQ3LTM1MjAtNDgzZi1iNWNlLTdmMTIzYzQxZGEyMVIiLCJ1c2VySWQiOm51bGwsIm9wdE91dCI6ZmFsc2UsInNlc3Npb25JZCI6MTUyODgwNTg2ODQ0NiwibGFzdEV2ZW50VGltZSI6MTUyODgwNjc0Nzk2OSwiZXZlbnRJZCI6ODUsImlkZW50aWZ5SWQiOjUsInNlcXVlbmNlTnVtYmVyIjo5MH0=; uvts=7an3MMNHYn0XBZYF; __atuvc=3%7C23; _gid=GA1.2.685188865.1528724539; amplitude_idundefinedstackshare.io=eyJvcHRPdXQiOmZhbHNlLCJzZXNzaW9uSWQiOm51bGwsImxhc3RFdmVudFRpbWUiOm51bGwsImV2ZW50SWQiOjAsImlkZW50aWZ5SWQiOjAsInNlcXVlbmNlTnVtYmVyIjowfQ==; _gat=1; _gali=wrap',
'Host'  :'stackshare.io',
'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0',
'Referer'   :'https://stackshare.io/application_and_data',
'X-CSRF-Token' : 'OEhhwcDju+WcpweukjB09hDFPDhwqX…nm+4fAgbMceRxnCz7gg4g//jDEg==',
'X-Requested-With'  : 'XMLHttpRequest'
}

payload = {}

response = requests.post(request_url, data=payload, headers=request_headers)

print response

观察:我得到了499响应代码。我需要提供多少有效载荷? 我检查了XHR请求,但找不到正确的URL,导致。

0 个答案:

没有答案