出于教育目的,我试图在9gag.com/hot page上对每个图像进行webscrap,我正在学习python和webscrapping。 这是我的非常基本格式的代码:
import requests, os, bs4
url = 'https://9gag.com/hot'
os.makedirs('9gag', exist_ok=True)
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
find = soup.findAll("img")
print(find)
这是我正在处理的html文件:
我在理解findAll()或其他汤方法的工作方式时遇到了一个问题,因为当我运行这段代码时,虽然有很多标签,但没有找到标签。我不知道该怎么找到东西:通过标签,通过标签及其类,由父级或如何找到?
答案 0 :(得分:0)
您看不到任何<img>
标签,因为该页面通过AJAX动态加载了内容。如果您在Firefox或Chrome中打开开发人员工具,则会看到主要内容是通过JSON从其他网址加载的:
import requests, json
url = 'https://9gag.com/v1/featured-posts'
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
data = res.json()
print(json.dumps(data, indent=4))
打印:
Downloading page https://9gag.com/v1/featured-posts...
{
"meta": {
"timestamp": 1562836411,
"status": "Success",
"sid": "9gVQ01EVjlHTUVkMMRVT1wEVFVTTn1TY"
},
"data": {
"items": [
{
"itemId": "27568",
"title": "The Corgi Who Plays Cheddar On Brooklyn Nine-Nine Has Passed Away",
"url": "https://9gag.com/gag/adLm8rZ",
"imageURL": "https://miscmedia-9gag-fun.9cache.com/images/featured/1562834921.0526_hYra9u_300.jpg",
"upVoteCount": 19,
"commentsCount": 12
},
{
"itemId": "27566",
"title": "Star Wars Reveals First Look At Sith Trooper For 'The Rise Of Skywalker'",
"url": "https://9gag.com/gag/aZLGyEW",
"imageURL": "https://miscmedia-9gag-fun.9cache.com/images/featured/1562833129.2422_NUTeny_300.jpg",
"upVoteCount": 21,
"commentsCount": 26
},
... and so on.
答案 1 :(得分:0)
如前所述,内容是动态加载的。 您可以使用支持JavaScript的requests_html代替请求。
masterList.forEach(cat_met_item => {
selectedList.forEach(filled_cat_met_item => {
if (cat_met_item.category_id === filled_cat_met_item.category_id) {
cat_met_item.meta_list.forEach(met_item => {
filled_cat_met_item.meta_list.forEach(filled_met_item => {
if (met_item.id === filled_met_item.id) {
met_item["list"] = filled_met_item["item_list"];
}
});
});
}
});
});
它给你类似的东西:
myVariable = 'This string is supposed to be raw \'
print(r'%s' %myVariable)