我试图抓取'https://www.kaggle.com/kernels'以便返回网站上的所有标题名称,但我遇到的问题是此详细信息的容器'div data-reactroot'不是拉入刮下的数据。
import urllib
from bs4 import BeautifulSoup
kaggle = 'https://www.kaggle.com/kernels'
data = urllib.request.urlopen(kaggle).read()
htmlparse = BeautifulSoup(data, 'html.parser')
print(htmlparse.findAll("div", {"class" : "block-link block-link--bordered"}))
我的代码中是否有错误或网站上是否存在某种阻止我抓取此数据的阻止?
答案 0 :(得分:0)
正如Elis Byberi所写,问题实际上是你在从后端呈现数据之前尝试获取数据。您可以使用phantomjs在后端工作后获取页面内容。你可以找到小教程here
答案 1 :(得分:0)
每次请求页面时,JavaScript都会以json格式提取所需的数据。您可以从" https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all"像这样。
import requests
import json
source = requests.get("https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all")
json_obj = source.json()
for a in json_obj:
print (a["title"])
输出:
2004-2005 Landfalling Hurricanes animation
Visualization of StockData
Generating Sentences One Letter at a Time
Decoding the Sexiest Job of 21st Century!!
Novice to Grandmaster
Analysis on Pokemon Data
ROC Curve with k-Fold CV
Japan Bulgaria trade playground
Bootstrapping and CIs with Veteran Suicides
Replicating "Did I do that?" paper analyses with R
Social Progress Index and World Happiness Report
SVM+HOG On ColourCompositeImage
Low- level students
PyTorch Speech Recognition Challenge (WIP)
Loans -getting Insights
Exploring Youtube Trending Statistics EDA
3 Simple Steps (LB: .9878 with new data)
Titanic: Neural Network using Keras
Feature Engineering
Why do employees leave and what to do about it
你唯一需要改变的是"之后"查询字符串参数,在我的请求中是439354但您可以将其设置为0以获取第一个记录。
您还可以通过更改" pageSize"来更改返回的记录数量。查询字符串参数" https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"
输出:
Data ScienceTutorial for Beginners
Data visualization and investigation
Spooky NLP and Topic Modelling tutorial
20 Years Of Games Analysis
NYC Taxi EDA - Update: The fast & the curious
或者urllib的例子:
import urllib.request
import json
kaggle = "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"
data = urllib.request.urlopen(kaggle).read()
json_obj = json.loads(data.decode("utf-8"))
for a in json_obj:
print (a["title"])