这应该是一件容易的事,但是我无法处理,因为我对(甚至是非常基本的)Web体系结构一无所知。
我想访问https://www.coursera.org/browse/arts-and-humanities/history
下每个课程的链接,并使用一些过滤器(例如language=english
):Coursera history courses in English。
加载此网页后,许多课程在向下滚动之前不会显示。如果将html文件保存到本地,则只能找到58个https://www.coursera.org/learn/
实例,这是课程的前缀,但我想至少要获得128个实例。
那么现在如何使用Chrome或Python保存动态加载的网页?
使用@Rajat的代码,模拟器可以向下滚动到底部,但是仍然无法获得所获得的html。
import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver
current_dir=os.getcwd()
#download chromedriver for you operating system
driver = webdriver.Chrome(current_dir+'/chromedriver')
#place your url here
url="https://www.coursera.org/browse/arts-and-humanities/history?facets=skillNameMultiTag%2CjobTitleMultiTag%2CdifficultyLevelTag%2Clanguages%3AEnglish%2CentityTypeTag%2CpartnerMultiTag%2CcategoryMultiTag%2CsubcategoryMultiTag%3Ahistory&sortField="
driver.get(url)
count = 1200
step = 30
for _ in range(count):
driver.execute_script("window.scrollBy(0, {});".format(step))
time.sleep(0.01)
with open("output.html", "w") as file:
file.write(driver.page_source)
答案 0 :(得分:1)
您应该使用Selenium Web驱动程序,而我使用chromedriver进行此项工作,它将打开您的网页并执行向下滚动功能,您只需要确定实现向下滚动的条件即可。
import os
from bs4 import BeautifulSoup
from selenium import webdriver
current_dir=os.getcwd()
#download chromedriver for you operating system
driver = webdriver.Chrome(current_dir+'/chromedriver')
#place your url here
url="https://stackoverflow.com"
driver.get(url)
#you can use how many scroll do you want using loop
driver.execute_script("window.scrollTo(0, {});".format(count*1400))
time.sleep(2)
inner_html=driver.page_source
soup=BeautifulSoup(inner_html,'html.parser')
这里汤将包含此网页的所有html数据
答案 1 :(得分:1)
似乎他们正在使用graphql来获取结果。在站点上似乎也没有任何身份验证。您可以使用任何喜欢的工具(python,curl,postman等)使用简单的post调用来获取结果。由于您的原始代码在python中,因此以下是使用python的简单代码段:
#!/usr/bin/env python
import requests
import json
import warnings
warnings.filterwarnings("ignore")
def getHeadersb345e918473d():
result={}
result['content-type']='application/json'
return result
def json_data_e6084285():
result=[]
result_item0={}
result_item0['query']='query catalogResultQuery($facets: [String!]!, $start: String!, $skip: Boolean = false, $sortField: String, $limit: Int) { CatalogResultsV2Resource { browseV2(facets: $facets, start: $start, limit: $limit, sortField: $sortField) @skip(if: $skip) { elements { label entries { id score courseId specializationId onDemandSpecializationId resourceName __typename } domainId subdomainId facets courses { elements { ...CourseFragment __typename } __typename } s12ns { elements { ...S12nFragment __typename } __typename } __typename } paging { total next __typename } __typename } __typename } } fragment CourseFragment on CoursesV1 { id slug name photoUrl s12nIds level workload courseDerivativesV2 { skillTags { skillName relevanceScore __typename } avgLearningHoursAdjusted commentCount averageFiveStarRating ratingCount __typename } partners { elements { name squareLogo classLogo logo __typename } __typename } __typename } fragment S12nFragment on OnDemandSpecializationsV1 { name id slug logo courseIds derivativeV2 { averageFiveStarRating avgLearningHoursAdjusted __typename } partners { elements { name squareLogo classLogo logo __typename } __typename } metadata { headerImage level __typename } courses { elements { courseDerivativesV2 { skillTags { skillName relevanceScore __typename } __typename } __typename } __typename } __typename } '
variables={}
variables['skip']=False
facets=[]
facets.append('skillNameMultiTag')
facets.append('jobTitleMultiTag')
facets.append('difficultyLevelTag')
facets.append('languages:English')
facets.append('entityTypeTag')
facets.append('partnerMultiTag')
facets.append('categoryMultiTag')
facets.append('subcategoryMultiTag:history')
variables['facets']=facets
variables['limit']=300
variables['start']='0'
variables['sortField']=''
result_item0['variables']=variables
result_item0['operationName']='catalogResultQuery'
result.append(result_item0)
return result
url='https://www.coursera.org/graphqlBatch'
r=requests.post(url, headers=getHeadersb345e918473d(), data=json.dumps(json_data_e6084285()), verify=False )
print unicode(r.text)
您可以修改limit
和start
的值以获得所需的结果。
答案 2 :(得分:1)
我拿了@Gautam代码,只重建了它。
第一个请求仅提供100个项目(即使限制为300),因此使用start
可以得到接下来的28个项目。
使用json=
代替data=
,我不需要headers=
和json.dump()
#!/usr/bin/env python
import requests
import warnings
warnings.filterwarnings("ignore")
def display(data):
#print('len:', len(data))
#print('len:', len(data[0]['data']['CatalogResultsV2Resource']['browseV2']['elements']))
print('>>> len:', len(data[0]['data']['CatalogResultsV2Resource']['browseV2']['elements'][0]['courses']['elements']))
items = data[0]['data']['CatalogResultsV2Resource']['browseV2']['elements'][0]['courses']['elements']
for item in items:
print(item['name'])
#for key, value in item.items():
# print(key, value)
#print('---')
#-----------------------------------------------------------
json_data = [{
'operationName': 'catalogResultQuery',
'variables': {
'skip': False,
'limit': 300,
'start': '0',
'sortField': '',
'facets': [
'skillNameMultiTag',
'jobTitleMultiTag',
'difficultyLevelTag',
'languages:English',
'entityTypeTag',
'partnerMultiTag',
'categoryMultiTag',
'subcategoryMultiTag:history'
]
},
'query': 'query catalogResultQuery($facets: [String!]!, $start: String!, $skip: Boolean = false, $sortField: String, $limit: Int) { CatalogResultsV2Resource { browseV2(facets: $facets, start: $start, limit: $limit, sortField: $sortField) @skip(if: $skip) { elements { label entries { id score courseId specializationId onDemandSpecializationId resourceName __typename } domainId subdomainId facets courses { elements { ...CourseFragment __typename } __typename } s12ns { elements { ...S12nFragment __typename } __typename } __typename } paging { total next __typename } __typename } __typename } } fragment CourseFragment on CoursesV1 { id slug name photoUrl s12nIds level workload courseDerivativesV2 { skillTags { skillName relevanceScore __typename } avgLearningHoursAdjusted commentCount averageFiveStarRating ratingCount __typename } partners { elements { name squareLogo classLogo logo __typename } __typename } __typename } fragment S12nFragment on OnDemandSpecializationsV1 { name id slug logo courseIds derivativeV2 { averageFiveStarRating avgLearningHoursAdjusted __typename } partners { elements { name squareLogo classLogo logo __typename } __typename } metadata { headerImage level __typename } courses { elements { courseDerivativesV2 { skillTags { skillName relevanceScore __typename } __typename } __typename } __typename } __typename } '
}]
url = 'https://www.coursera.org/graphqlBatch'
#headers = {'content-type': 'application/json'}
#r = requests.post(url, headers=headers, json=json_data, verify=False)
# --- it gives first 100 items ---
r = requests.post(url, json=json_data, verify=False)
data = r.json()
display(data)
# --- it gives next 28 items ---
json_data[0]['variables']['start'] = str(100) # it has to be string, not integer
r = requests.post(url, json=json_data, verify=False)
data = r.json()
display(data)
结果开始:
>>> len: 100
Buddhism and Modern Psychology
English Composition I
Fashion as Design
The Modern World, Part One: Global History from 1760 to 1910
Indigenous Canada
Understanding Einstein: The Special Theory of Relativity
Terrorism and Counterterrorism: Comparing Theory and Practice
Magic in the Middle Ages
The Ancient Greeks
Introduction to Ancient Egypt and Its Civilization
结果结尾:
>>> len: 28
Theatre and Globalization
ART of the MOOC: Arte Público y Pedagogía
The Music of the Rolling Stones, 1962-1974
Soul Beliefs: Causes and Consequences - Unit 2: Belief Systems
The Making of the US President: A Short History in Five Elections
Cities are back in town : sociologie urbaine pour un monde globalisé
Toledo: Deciphering Secrets of Medieval Spain
Russia and Nuclear Arms Control
Espace mondial, a French vision of Global studies
Religious Transformation in Early China: the Period of Division
Patrick Henry: Forgotten Founder
A la recherche du Grand Paris
Burgos: Deciphering Secrets of Medieval Spain
Journey Conversations: Weaving Knowledge and Action
Structuring Values in Modern China
Religion and Thought in Modern China: the Song, Jin, and Yuan
宇宙之旅:展现生命 (Journey of the Universe: The Unfolding of Life)
The Worldview of Thomas Berry: The Flourishing of the Earth Community
Science and Technology in the Silla Cultural Heritage
世界空间、法国视角下的国���研究
Fundamentals of the Chinese character writing (Part 1)
Understanding China, 1700-2000: A Data Analytic Approach, Part 2
"Espace mondial" الرؤية الفرنسية للدراسات العالمية
Searching for the Grand Paris
宇宙之旅:对话 (Journey of the Universe: Weaving Knowledge and Action)
Contemporary India
Thomas Berry的世界观:地球社区的繁荣 (The Worldview of Thomas Berry: The Flourishing of the Earth Community)
"Making" Progress Teach-Out